What exactly is a connection in R?
I've read through and successfully use ?connections
in R but I really don't understand what they are.
I get that I can download a file, read and write a compressed file, ... (that is I understand what the result of using a conection (open, do stuff, close) but I really don't understand what they actually do, why you have to open and close them and so on).
I'm hoping this will also help me understand how to more effectively use them (principally understand the mechanics of what is happening so I can effectively debug when something is not working).
Connections were introduced in R 1.2.0 and described by Brian Ripley in the first issue of R NEWS (now called The R Journal) of January 2001 (page 16-17) as an abstracted interface to IO streams such as a file, url, socket, or pipe. In 2013, Simon Urbanek added a Connections.h C API which enables R packages to implement custom connection types, such as the curl package.
One feature of connections is that you can incrementally read or write pieces of data from/to the connection using the readBin
, writeBin
, readLines
and writeLines
functions. This allows for asynchronous data processing, for example when dealing with large data or network connections:
# Read the first 30 lines, 10 lines at a time
con <- url("http://jeroen.github.io/data/diamonds.json")
open(con, "r")
data1 <- readLines(con, n = 10)
data2 <- readLines(con, n = 10)
data3 <- readLines(con, n = 10)
close(con)
Same for writing, e.g. to a file:
tmp <- file(tempfile())
open(tmp, "w")
writeLines("A line", tmp)
writeLines("Another line", tmp)
close(tmp)
Open the connection as rb
or wb
to read/write binary data (called raw vectors in R):
# Read the first 3000 bytes, 1000 bytes at a time
con <- url("http://jeroen.github.io/data/diamonds.json")
open(con, "rb")
data1 <- readBin(con, raw(), n = 1000)
data2 <- readBin(con, raw(), n = 1000)
data3 <- readBin(con, raw(), n = 1000)
close(con)
The pipe()
connection is used to run a system command and pipe text to stdin
or from stdout
as you would do with the |
operator in a shell. E.g. (lets stick with the curl examples), you can run the curl
command line program and pipe the output to R:
con <- pipe("curl -H 'Accept: application/json' https://jeroen.github.io/data/diamonds.json")
open(con, "r")
data1 <- readLines(con, n = 10)
data2 <- readLines(con, n = 10)
data3 <- readLines(con, n = 10)
Some aspects of connections are a bit confusing: to incrementally read/write data you need to explicitly open()
and close()
the connection. However, readLines
and writeLines
automatically open and close (but not destroy!) an unopened connection. As a result, the example below will read the first 10 lines over and over again which is not very useful:
con <- url("http://jeroen.github.io/data/diamonds.json")
data1 <- readLines(con, n = 10)
data2 <- readLines(con, n = 10)
data3 <- readLines(con, n = 10)
identical(data1, data2)
Another gotcha is that the C API can both close and destroy a connection, but R only exposes a function called close()
which actually means destroy. After calling close()
on a connection it is destroyed and completely useless.
To stream-process data form a connection you want to use a pattern like this:
stream <- function(){
con <- url("http://jeroen.github.io/data/diamonds.json")
open(con, "r")
on.exit(close(con))
while(length(txt <- readLines(con, n = 10))){
some_callback(txt)
}
}
The jsonlite
package relies heavily on connections to import/export ndjson data:
library(jsonlite)
library(curl)
diamonds <- stream_in(curl("https://jeroen.github.io/data/diamonds.json"))
The streaming (by default 1000 lines at a time) makes it fast and memory efficient:
library(nycflights13)
stream_out(flights, file(tmp <- tempfile()))
flights2 <- stream_in(file(tmp))
all.equal(flights2, as.data.frame(flights))
Finally one nice feature about connections is that the garbage collector will automatically close them if you forget to do so, with an annoying warning:
con <- file(system.file("DESCRIPTION"), open = "r")
rm(con)
gc()