Read fixed width text file

Solution 1:

This is a fixed width file. Use read.fwf() to read it:

x <- read.fwf(
  file=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"),
  skip=4,
  widths=c(12, 7, 4, 9, 4, 9, 4, 9, 4))

head(x)

            V1   V2   V3   V4   V5   V6   V7   V8  V9
1  03JAN1990   23.4 -0.4 25.1 -0.3 26.6  0.0 28.6 0.3
2  10JAN1990   23.4 -0.8 25.2 -0.3 26.6  0.1 28.6 0.3
3  17JAN1990   24.2 -0.3 25.3 -0.3 26.5 -0.1 28.6 0.3
4  24JAN1990   24.4 -0.5 25.5 -0.4 26.5 -0.1 28.4 0.2
5  31JAN1990   25.1 -0.2 25.8 -0.2 26.7  0.1 28.4 0.2
6  07FEB1990   25.8  0.2 26.1 -0.1 26.8  0.1 28.4 0.3

Update

The package readr (released April, 2015) provides a simple and fast alternative.

library(readr)

x <- read_fwf(
  file="http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",   
  skip=4,
  fwf_widths(c(12, 7, 4, 9, 4, 9, 4, 9, 4)))

Speed comparison: readr::read_fwf() was ~2x faster than utils::read.fwf ().

Solution 2:

Another way to determine widths...

df <- read.fwf(
  file=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"),
  widths=c(-1, 9, -5, 4, 4, -5, 4, 4, -5, 4, 4, -5, 4, 4),
  skip=4
)

The -1 in the widths argument says there is a one-character column that should be ignored,the -5 in the widths argument says there is a five-character column that should be ignored, likewise...

ref : https://www.inkling.com/read/r-cookbook-paul-teetor-1st/chapter-4/recipe-4-6

Solution 3:

First off, that question is directly from a the Coursera "Get Data and Clean It" course by Leeks. While there is another part of the question, the tough part is reading the file.

That said, the course is mostly intended for learning.

I hate R's fixed width procedure. It is slow and for large number of variables, it very quickly becomes a pain to negate certain columns, etc.

I think its easier to use readLines() and then from that use substr() to make your variables

x <- readLines(con=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"))

# Skip 4 lines
x <- x[-(1:4)]

mydata <- data.frame(var1 = substr(x, 1, 10),
                     var2 = substr(x, 16, 19),
                     var3 = substr(x, 20, 23),
                     var4 = substr(x, 29, 32)  # and so on and so on
                     )

Solution 4:

You can now use the read_fwf() function in Hadley Wickham's readr package.

  • Annoucement: http://blog.rstudio.org/2015/04/09/readr-0-1-0/
  • Development page: https://github.com/hadley/readr
  • CRAN page: http://cran.r-project.org/web/packages/readr/index.html

A huge performance improvement is to be expected, compared to base read.fwf().