The curl package provides bindings to the libcurl C library for R. The package supports retrieving data in-memory, downloading to disk, or streaming using the R “connection” interface. Some knowledge of curl is recommended to use this package. For a more user-friendly HTTP client, have a look at the httr package which builds on curl with HTTP specific tools and logic.
The curl package implements several interfaces to retrieve data from a URL:
curl_fetch_memory()
saves response in memorycurl_download()
or curl_fetch_disk()
writes response to diskcurl()
or curl_fetch_stream()
streams response datacurl_fetch_multi()
(Advanced) process responses via callback functionsEach interface performs the same HTTP request, they only differ in how response data is processed.
The curl_fetch_memory
function is a blocking interface which waits for the request to complete and returns a list with all content (data, headers, status, timings) of the server response.
req <- curl_fetch_memory("https://httpbin.org/get")
str(req)
List of 6
$ url : chr "https://httpbin.org/get"
$ status_code: int 200
$ headers : raw [1:220] 48 54 54 50 ...
$ modified : POSIXct[1:1], format: NA
$ times : Named num [1:6] 0 0.000622 0.093211 0.292991 0.387018 ...
..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
$ content : raw [1:231] 7b 0a 20 20 ...
parse_headers(req$headers)
[1] "HTTP/1.1 200 OK"
[2] "Server: nginx"
[3] "Date: Fri, 21 Oct 2016 11:16:15 GMT"
[4] "Content-Type: application/json"
[5] "Content-Length: 231"
[6] "Connection: keep-alive"
[7] "Access-Control-Allow-Origin: *"
[8] "Access-Control-Allow-Credentials: true"
cat(rawToChar(req$content))
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "r/curl/jeroen"
},
"origin": "145.136.159.98",
"url": "https://httpbin.org/get"
}
The curl_fetch_memory
interface is the easiest interface and most powerful for buidling API clients. However it is not suitable for downloading really large files because it is fully in-memory. If you are expecting 100G of data, you probably need one of the other interfaces.
The second method is curl_download
, which has been designed as a drop-in replacement for download.file
in r-base. It writes the response straight to disk, which is useful for downloading (large) files.
tmp <- tempfile()
curl_download("https://httpbin.org/get", tmp)
cat(readLines(tmp), sep = "\n")
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "r/curl/jeroen"
},
"origin": "145.136.159.98",
"url": "https://httpbin.org/get"
}
The most flexible interface is the curl
function, which has been designed as a drop-in replacement for base url
. It will create a so-called connection object, which allows for incremental (asynchronous) reading of the response.
con <- curl("https://httpbin.org/get")
open(con)
# Get 3 lines
out <- readLines(con, n = 3)
cat(out, sep = "\n")
{
"args": {},
"headers": {
# Get 3 more lines
out <- readLines(con, n = 3)
cat(out, sep = "\n")
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
# Get remaining lines
out <- readLines(con)
close(con)
cat(out, sep = "\n")
"User-Agent": "r/curl/jeroen"
},
"origin": "145.136.159.98",
"url": "https://httpbin.org/get"
}
The example shows how to use readLines
on an opened connection to read n
lines at a time. Similarly readBin
is used to read n
bytes at a time for stream parsing binary data.
As of curl 2.0
the package provides a non-blocking interface which can perform multiple simultaneous requests. The curl_fetch_multi
adds a request to a pool and returns immediately; it does not actually perform the request.
pool <- new_pool()
cb <- function(req){cat("done:", req$url, ": HTTP:", req$status, "\n")}
curl_fetch_multi('https://www.google.com', done = cb, pool = pool)
curl_fetch_multi('https://cloud.r-project.org', done = cb, pool = pool)
curl_fetch_multi('https://httpbin.org/blabla', done = cb, pool = pool)
When we call multi_run()
, all scheduled requests are performed concurrently. The callback functions get triggered when the request completes.
# This actually performs requests:
out <- multi_run(pool = pool)
done: https://cloud.r-project.org/ : HTTP: 200
done: https://www.google.nl/?gfe_rd=cr&ei=B_kJWJ_UAeiv8wfB_YDgAw : HTTP: 200
done: https://httpbin.org/blabla : HTTP: 404
print(out)
$success
[1] 3
$error
[1] 0
$pending
[1] 0
The system allows for running many concurrent non-blocking requests. However it is quite complex and requires careful specification of handler functions.
A HTTP requests can encounter two types of errors:
The first type of errors (connection failures) will always raise an error in R for each interface. However if the requests succeeds and the server returns a non-success HTTP status code, only curl()
and curl_download()
will raise an error. Let’s dive a little deeper into this.
The curl
and curl_download
functions are safest to use because they automatically raise an error if the request was completed but the server returned a non-success (400 or higher) HTTP status. This mimics behavior of base functions url
and download.file
. Therefore we can safely write code like this:
# This is OK
curl_download('https://cran.r-project.org/CRAN_mirrors.csv', 'mirrors.csv')
mirros <- read.csv('mirrors.csv')
unlink('mirrors.csv')
If the HTTP request was unsuccesful, R will not continue:
# Oops! A typo in the URL!
curl_download('https://cran.r-project.org/CRAN_mirrorZ.csv', 'mirrors.csv')
Error in curl_download("https://cran.r-project.org/CRAN_mirrorZ.csv", : HTTP error 404.
con <- curl('https://cran.r-project.org/CRAN_mirrorZ.csv')
open(con)
Error in open.connection(con): HTTP error 404.
When using any of the curl_fetch_*
functions it is important to realize that these do not raise an error if the request was completed but returned a non-200 status code. When using curl_fetch_memory
or curl_fetch_disk
you need to implement such application logic yourself and check if the response was successful.
req <- curl_fetch_memory('https://cran.r-project.org/CRAN_mirrors.csv')
print(req$status_code)
[1] 200
Same for downloading to disk. If you do not check your status, you might have downloaded an error page!
# Oops a typo!
req <- curl_fetch_disk('https://cran.r-project.org/CRAN_mirrorZ.csv', 'mirrors.csv')
print(req$status_code)
[1] 404
# This is not the CSV file we were expecting!
head(readLines('mirrors.csv'))
[1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"
[2] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\""
[3] " \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">"
[4] "<html xmlns=\"http://www.w3.org/1999/xhtml\" lang=\"en\" xml:lang=\"en\">"
[5] "<head>"
[6] "<title>Object not found!</title>"
unlink('mirrors.csv')
If you do want the curl_fetch_*
functions to automatically raise an error, you should set the FAILONERROR
option to TRUE
in the handle of the request.
h <- new_handle(failonerror = TRUE)
curl_fetch_memory('https://cran.r-project.org/CRAN_mirrorZ.csv', handle = h)
Error in curl_fetch_memory("https://cran.r-project.org/CRAN_mirrorZ.csv", : HTTP response code said error
By default libcurl uses HTTP GET to issue a request to an HTTP url. To send a customized request, we first need to create and configure a curl handle object that is passed to the specific download interface.
Creating a new handle is done using new_handle
. After creating a handle object, we can set the libcurl options and http request headers.
h <- new_handle()
handle_setopt(h, copypostfields = "moo=moomooo");
handle_setheaders(h,
"Content-Type" = "text/moo",
"Cache-Control" = "no-cache",
"User-Agent" = "A cow"
)
Use the curl_options()
function to get a list of the options supported by your version of libcurl. The libcurl documentation explains what each option does. Option names are not case sensitive.
After the handle has been configured, it can be used with any of the download interfaces to perform the request. For example curl_fetch_memory
will load store the output of the request in memory:
req <- curl_fetch_memory("http://httpbin.org/post", handle = h)
cat(rawToChar(req$content))
{
"args": {},
"data": "moo=moomooo",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Cache-Control": "no-cache",
"Content-Length": "11",
"Content-Type": "text/moo",
"Host": "httpbin.org",
"User-Agent": "A cow"
},
"json": null,
"origin": "145.136.159.98",
"url": "http://httpbin.org/post"
}
Alternatively we can use curl()
to read the data of via a connetion interface:
con <- curl("http://httpbin.org/post", handle = h)
cat(readLines(con), sep = "\n")
{
"args": {},
"data": "moo=moomooo",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Cache-Control": "no-cache",
"Content-Length": "11",
"Content-Type": "text/moo",
"Host": "httpbin.org",
"User-Agent": "A cow"
},
"json": null,
"origin": "145.136.159.98",
"url": "http://httpbin.org/post"
}
Or we can use curl_download
to write the response to disk:
tmp <- tempfile()
curl_download("http://httpbin.org/post", destfile = tmp, handle = h)
cat(readLines(tmp), sep = "\n")
{
"args": {},
"data": "moo=moomooo",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Cache-Control": "no-cache",
"Content-Length": "11",
"Content-Type": "text/moo",
"Host": "httpbin.org",
"User-Agent": "A cow"
},
"json": null,
"origin": "145.136.159.98",
"url": "http://httpbin.org/post"
}
Or perform the same request with a multi pool:
curl_fetch_multi("http://httpbin.org/post", handle = h, done = function(res){
cat("Request complete! Response content:\n")
cat(rawToChar(res$content))
})
# Perform the request
out <- multi_run()
Request complete! Response content:
{
"args": {},
"data": "moo=moomooo",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Cache-Control": "no-cache",
"Content-Length": "11",
"Content-Type": "text/moo",
"Host": "httpbin.org",
"User-Agent": "A cow"
},
"json": null,
"origin": "145.136.159.98",
"url": "http://httpbin.org/post"
}
As we have already seen, curl allows for reusing a single handle for multiple requests. However it is not always a good idea to do so. The performance overhead of creating and configuring a new handle object is usually negligible. The safest way to issue mutiple requests, either to a single server or multiple servers is by using a separate handle for each request.
req1 <- curl_fetch_memory("https://httpbin.org/get", handle = new_handle())
req2 <- curl_fetch_memory("http://www.r-project.org", handle = new_handle())
There are two reasons why you might want to reuse a handle for multiple requests. The first one is that it will automatically keep track of cookies set by the server. This might be useful if your host requires use of a session cookies.
The other reason is to take advantage of http Keep-Alive. Curl automatically maintains a pool of open http connections within each handle. When using a single handle to issue many requests to the same server, curl uses existing connections when possible. This eliminites a little bit of connection overhead, although on a decent network this might not be very significant.
h <- new_handle()
system.time(curl_fetch_memory("https://api.github.com/users/ropensci", handle = h))
user system elapsed
0.018 0.001 0.394
system.time(curl_fetch_memory("https://api.github.com/users/rstudio", handle = h))
user system elapsed
0.025 0.012 0.301
The argument against reusing handles is that curl does not cleanup the handle after each request. All of the options and internal fields will linger around for all future request until explicitly reset or overwritten. This can sometimes leads to unexpected behavior.
handle_reset(h)
The handle_reset
function will reset all curl options and request headers to the default values. It will not erease cookies and it will still keep alive the connections. Therefore it is good practice to call handle_reset
after performing a request if you want to reuse the handle for a subsequent request. Still it is always safer to create a new fresh handle when possible, rather than recycling old ones.
The handle_setform
function is used to perform a multipart/form-data
HTTP POST request (a.k.a. posting a form). Values can be either strings, raw vectors (for binary data) or files.
# Posting multipart
h <- new_handle()
handle_setform(h,
foo = "blabla",
bar = charToRaw("boeboe"),
description = form_file(system.file("DESCRIPTION")),
logo = form_file(file.path(Sys.getenv("R_DOC_DIR"), "html/logo.jpg"), "image/jpeg")
)
req <- curl_fetch_memory("http://httpbin.org/post", handle = h)
The form_file
function is used to upload files with the form post. It has two arguments: a file path, and optionally a content-type value. If no content-type is set, curl will guess the content type of the file based on the file extention.
All of the handle_xxx
functions return the handle object so that function calls can be chained using the popular pipe operators:
library(magrittr)
new_handle() %>%
handle_setopt(copypostfields = "moo=moomooo") %>%
handle_setheaders("Content-Type" = "text/moo", "Cache-Control" = "no-cache", "User-Agent" = "A cow") %>%
curl_fetch_memory(url = "http://httpbin.org/post") %$% content %>% rawToChar %>% cat
{
"args": {},
"data": "moo=moomooo",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Cache-Control": "no-cache",
"Content-Length": "11",
"Content-Type": "text/moo",
"Host": "httpbin.org",
"User-Agent": "A cow"
},
"json": null,
"origin": "145.136.159.98",
"url": "http://httpbin.org/post"
}