Read observations in fixed width files spanning multiple lines in R

I am trying to read a dat file in R. The data is available here.

The specification about the dataset is available here.

The first part of my problem is solved by using the sep option and providing information about where every column starts and providing the na.strings="*". However, I do not know how to handle a single observation spanning more than 1 line.

In this data set, all the observations span 2 lines.

Solution 1:

You actually need read.fwf for this.

Set up some sample data

    txt <- 'Acura         Integra        Small   12.9 15.9 18.8 25 31 0 1 4 1.8 140 6300
2890 1 13.2 5 177 102 68 37 26.5 11 2705 0
Acura         Legend         Midsize 29.2 33.9 38.7 18 25 2 1 6 3.2 200 5500
2335 1 18.0 5 195 115 71 38 30.0 15 3560 0
Audi          90             Compact 25.9 29.1 32.3 20 26 1 1 6 2.8 172 5500
2280 1 16.9 5 180 102 67 37 28.0 14 3375 0'

Read using read.fwf - pay attention to widths argument. The widths should be a list of 2 integer vectors specifying element widths on multiple lines

DF <- read.fwf(textConnection(txt), 
               widths = list(
                 c(14, 15, 8, 5, 5, 5, 3, 3, 2, 2, 2, 4, 4, 4), 
                 c(5, 2, 5, 2, 4, 4, 3, 3, 5, 3, 5, 1)
               ), 
               header = FALSE)

Using pander package to pretty print the table since it has so many columns.

require(pander)
pandoc.table(DF)
## 
## ---------------------------------------------------
##  V1     V2      V3     V4   V5   V6   V7   V8   V9 
## ----- ------- ------- ---- ---- ---- ---- ---- ----
## Acura Integra  Small  12.9 15.9 18.8  25   31   0  
## 
## Acura Legend  Midsize 29.2 33.9 38.7  18   25   2  
## 
## Audi    90    Compact 25.9 29.1 32.3  20   26   1  
## ---------------------------------------------------
## 
## Table: Table continues below
## 
##  
## -----------------------------------------------
##  V10   V11   V12   V13   V14   V15   V16   V17 
## ----- ----- ----- ----- ----- ----- ----- -----
##   1     4    1.8   140  6300  2890    1   13.2 
## 
##   1     6    3.2   200  5500  2335    1   18.0 
## 
##   1     6    2.8   172  5500  2280    1   16.9 
## -----------------------------------------------
## 
## Table: Table continues below
## 
##  
## -----------------------------------------------
##  V18   V19   V20   V21   V22   V23   V24   V25 
## ----- ----- ----- ----- ----- ----- ----- -----
##   5    177   102   68    37   26.5   11   2705 
## 
##   5    195   115   71    38   30.0   15   3560 
## 
##   5    180   102   67    37   28.0   14   3375 
## -----------------------------------------------
## 
## Table: Table continues below
## 
##  
## -----
##  V26 
## -----
##   0  
## 
##   0  
## 
##   0  
## -----
##

Solution 2:

Here's a workaround:

link <- "http://www.amstat.org/publications/jse/datasets/93cars.dat.txt"

Read the raw lines:

rawlines <- readLines(file(link))

Convert to one text string:

lines <- paste(rawlines[c(TRUE, FALSE)], rawlines[c(FALSE, TRUE)], collapse = "\n")

The function paste is used to combine multiple strings. rawlines[c(TRUE, FALSE)] represents the odd lines, rawlines[c(FALSE, TRUE)] represents the even lines. (For details on how even and odd elements can be selected with boolean values, see this answer.) Both small lines are combined to one long line. Then, all long lines are combined into one single string with the argument collapse = "\n". The lines are separated by line breaks (\n).

Read the new text with read.table:

dat <- read.table(text = lines, na.string = "*")

The result:

> head(dat)
     V1      V2      V3   V4   V5   V6 V7 V8 V9 V10 V11 V12 V13  V14  V15 V16  V17
1 Acura Integra   Small 12.9 15.9 18.8 25 31  0   1   4 1.8 140 6300 2890   1 13.2
2 Acura  Legend Midsize 29.2 33.9 38.7 18 25  2   1   6 3.2 200 5500 2335   1 18.0
3  Audi      90 Compact 25.9 29.1 32.3 20 26  1   1   6 2.8 172 5500 2280   1 16.9
4  Audi     100 Midsize 30.8 37.7 44.6 19 26  2   1   6 2.8 172 5500 2535   1 21.1
5   BMW    535i Midsize 23.7 30.0 36.2 22 30  1   0   4 3.5 208 5700 2545   1 21.1
6 Buick Century Midsize 14.2 15.7 17.3 22 31  1   1   4 2.2 110 5200 2565   0 16.4
  V18 V19 V20 V21 V22  V23 V24  V25 V26
1   5 177 102  68  37 26.5  11 2705   0
2   5 195 115  71  38 30.0  15 3560   0
3   5 180 102  67  37 28.0  14 3375   0
4   6 193 106  70  37 31.0  17 3405   0
5   4 186 109  69  39 27.0  13 3640   0
6   6 189 105  69  41 28.0  16 2880   1

Read observations in fixed width files spanning multiple lines in R

Solution 1:

Solution 2:

Related

Recent Posts