Read list of file names from web into R

I am trying to read a lot of csv files into R from a website. Threa are multiple years of daily (business days only) files. All of the files have the same data structure. I can sucessfully read one file using the following logic:

# enter user credentials
user     <- "JohnDoe"
password <- "SecretPassword"
credentials <- paste(user,":",password,"@",sep="")
web.site <- "downloads.theice.com/Settlement_Reports_CSV/Power/"

# construct path to data
path <- paste("https://", credentials, web.site, sep="")

# read data for 4/10/2013
file  <- "icecleared_power_2013_04_10"
fname <- paste(path,file,".dat",sep="")
df <- read.csv(fname,header=TRUE, sep="|",as.is=TRUE)

However, Im looking for tips on how to read all the files in the directory at once. I suppose I could generate a sequence of dates an construct the file name above in a loop and use rbind to append each file but that seems cumbersome. Plus there will be issues when attempting to read weekends and holidays where there is no files.

The impages below show what the list of files look like in the web browser:

file list in browser part 1

... ... ...

file list in browser part 2

Is there a way to scan the path (from above) to get a list of all the file names in the directory first that meet a certin crieteia (i.e. start with "icecleared_power_" as there are also some files in that location that have a different starting name that I do not want to read in) then loop the read.csv through that list and use rbind to append?

Any guidance would be greatly appreciated?


I would first try to just scrape the links to the relevant data files and use the resulting information to construct the full download path that includes user logins and so on. As others have suggested, lapply would be convenient for batch downloading.

Here's an easy way to extract the URLs. Obviously, modify the example to suit your actual scenario.

Here, we're going to use the XML package to identify all the links available at the CRAN archives for the Amelia package (http://cran.r-project.org/src/contrib/Archive/Amelia/).

> library(XML)
> url <- "http://cran.r-project.org/src/contrib/Archive/Amelia/"
> doc <- htmlParse(url)
> links <- xpathSApply(doc, "//a/@href")
> free(doc)
> links
                   href                    href                    href 
             "?C=N;O=D"              "?C=M;O=A"              "?C=S;O=A" 
                   href                    href                    href 
             "?C=D;O=A" "/src/contrib/Archive/"  "Amelia_1.1-23.tar.gz" 
                   href                    href                    href 
 "Amelia_1.1-29.tar.gz"  "Amelia_1.1-30.tar.gz"  "Amelia_1.1-32.tar.gz" 
                   href                    href                    href 
 "Amelia_1.1-33.tar.gz"   "Amelia_1.2-0.tar.gz"   "Amelia_1.2-1.tar.gz" 
                   href                    href                    href 
  "Amelia_1.2-2.tar.gz"   "Amelia_1.2-9.tar.gz"  "Amelia_1.2-12.tar.gz" 
                   href                    href                    href 
 "Amelia_1.2-13.tar.gz"  "Amelia_1.2-14.tar.gz"  "Amelia_1.2-15.tar.gz" 
                   href                    href                    href 
 "Amelia_1.2-16.tar.gz"  "Amelia_1.2-17.tar.gz"  "Amelia_1.2-18.tar.gz" 
                   href                    href                    href 
  "Amelia_1.5-4.tar.gz"   "Amelia_1.5-5.tar.gz"   "Amelia_1.6.1.tar.gz" 
                   href                    href                    href 
  "Amelia_1.6.3.tar.gz"   "Amelia_1.6.4.tar.gz"     "Amelia_1.7.tar.gz" 

For the sake of demonstration, imagine that, ultimately, we only want the links for the 1.2 versions of the package.

> wanted <- links[grepl("Amelia_1\\.2.*", links)]
> wanted
                  href                   href                   href 
 "Amelia_1.2-0.tar.gz"  "Amelia_1.2-1.tar.gz"  "Amelia_1.2-2.tar.gz" 
                  href                   href                   href 
 "Amelia_1.2-9.tar.gz" "Amelia_1.2-12.tar.gz" "Amelia_1.2-13.tar.gz" 
                  href                   href                   href 
"Amelia_1.2-14.tar.gz" "Amelia_1.2-15.tar.gz" "Amelia_1.2-16.tar.gz" 
                  href                   href 
"Amelia_1.2-17.tar.gz" "Amelia_1.2-18.tar.gz" 

You can now use that vector as follows:

wanted <- links[grepl("Amelia_1\\.2.*", links)]
GetMe <- paste(url, wanted, sep = "")
lapply(seq_along(GetMe), 
       function(x) download.file(GetMe[x], wanted[x], mode = "wb"))

Update (to clarify your question in comments)

The last step in the example above downloads the specified files to your current working directory (use getwd() to verify where that is). If, instead, you know for sure that read.csv works on the data, you can also try to modify your anonymous function to read the files directly:

lapply(seq_along(GetMe), 
       function(x) read.csv(GetMe[x], header = TRUE, sep = "|", as.is = TRUE))

However, I think a safer approach might be to download all the files into a single directory first, and then use read.delim or read.csv or whatever works to read in the data, similar to as was suggested by @Andreas. I say safer because it gives you more flexibility in case files aren't fully downloaded and so on. In that case, instead of having to redownload everything, you would only need to download the files which were not fully downloaded.


@MikeTP, if all the reports start with "icecleared_power_" and a date which is a business date the package "timeDate" offers an easy way to create a vector of business dates, like so:

require(timeDate)
tSeq <- timeSequence("2012-01-01","2012-12-31") # vector of days
tBiz <- tSeq[isBizday(tSeq)] # vector of business days

and

paste0("icecleared_power_",as.character.Date(tBiz))

gives you the concatenated file name.

If the web site follows a different logic regarding the naming of files we need more information as Ananda Mahto observed.

Keep in mind that when you create a date vector with timeDate you can get much more sophisticated then my simple example. You can take into account holiday schedules, stock exchange dates etc.


You can try using the command "download.file".

### set up the path and destination
path <- "url where file is located"
dest <- "where on your hard disk you want the file saved"

### Ask R to try really hard to download your ".csv"
try(download.file(path, dest))

The trick to this is going to be figuring out how the "url" or "path" changes systematically between files. Often times, web pages are built such that the "url's" are systematic. In this case, you could potentially create a vector or data frame of url's to iterate over inside of an apply function.

All of this can be sandwiched inside of an "lapply". The "data" object is simply whatever we are iterating over. It could be a vector of URL's or a data frame of year and month observations, which could then be used to create URL's within the "lapply" function.

### "dl" will apply a function to every element in our vector "data"
  # It will also help keep track of files which have no download data
dl <- lapply(data, function(x) {
        path <- 'url'
        dest <- './data_intermediate/...'
        try(download.file(path, dest))
      })

### Assign element names to your list "dl"
names(dl) <- unique(data$name)
index     <- sapply(dl, is.null)

### Figure out which downloads returned nothing
no.download <- names(dl)[index]

You can then use "list.files()" to merge all data together, assuming they belong in one data.frame

### Create a list of files you want to merge together
files <- list.files()

### Create a list of data.frames by reading each file into memory
data  <- lapply(files, read.csv)

### Stack data together
data <- do.call(rbind, data)

Sometimes, you will notice the file has been corrupted after downloading. In this case, pay attention to the option contained within the download.file() command, "mode". You can set mode = "w" or mode = "wb" if the file is stored in a binary format.