How to read data when some numbers contain commas as thousand separator?

Solution 1:

Not sure about how to have read.csv interpret it properly, but you can use gsub to replace "," with "", and then convert the string to numeric using as.numeric:

y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1]  1200 20000 100 12111

This was also answered previously on R-Help (and in Q2 here).

Alternatively, you can pre-process the file, for instance with sed in unix.

Solution 2:

You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:

setClass("num.with.commas")
setAs("character", "num.with.commas", 
        function(from) as.numeric(gsub(",", "", from) ) )

Then run read.csv like:

DF <- read.csv('your.file.here', 
   colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))

Solution 3:

I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:

x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})

Solution 4:

This question is several years old, but I stumbled upon it, which means maybe others will.

The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.

library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
          col_types = list(col_numeric())
        )

This yields

Source: local data frame [4 x 1]

  numbers
    (dbl)
1   800.0
2  1800.0
3  3500.0
4     6.5

An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)

For instance, if I had not flagged the col_types, I would have gotten this:

> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]

  numbers
    (chr)
1     800
2   1,800
3    3500
4     6.5

(Notice that it is now a chr (character) instead of a numeric.)

Or, more dangerously, if it were long enough and most of the early elements did not contain commas:

> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")

(such that the last few elements look like:)

\"5\"\n\"9\"\n\"7\"\n\"1,003"

Then you'll find trouble reading that comma at all!

> tail(read_csv(tmp))
Source: local data frame [6 x 1]

     3"
  (dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details.