Converting string to numeric [duplicate]
I've imported a test file and tried to make a histogram
pichman <- read.csv(file="picman.txt", header=TRUE, sep="/t")
hist <- as.numeric(pichman$WS)
However, I get different numbers from values in my dataset. Originally I thought that this because I had text, so I deleted the text:
table(pichman$WS)
ws <- pichman$WS[pichman$WS!="Down" & pichman$WS!="NoData"]
However, I am still getting very high numbers does anyone have an idea?
I suspect you are having a problem with factors. For example,
> x = factor(4:8)
> x
[1] 4 5 6 7 8
Levels: 4 5 6 7 8
> as.numeric(x)
[1] 1 2 3 4 5
> as.numeric(as.character(x))
[1] 4 5 6 7 8
Some comments:
- You mention that your vector contains the characters "Down" and "NoData". What do expect/want
as.numeric
to do with these values? - In
read.csv
, try using the argumentstringsAsFactors=FALSE
- Are you sure it's
sep="/t
and notsep="\t"
- Use the command
head(pitchman)
to check the first fews rows of your data - Also, it's very tricky to guess what your problem is when you don't provide data. A minimal working example is always preferable. For example, I can't run the command
pichman <- read.csv(file="picman.txt", header=TRUE, sep="/t")
since I don't have access to the data set.
As csgillespie said. stringsAsFactors is default on TRUE, which converts any text to a factor. So even after deleting the text, you still have a factor in your dataframe.
Now regarding the conversion, there's a more optimal way to do so. So I put it here as a reference :
> x <- factor(sample(4:8,10,replace=T))
> x
[1] 6 4 8 6 7 6 8 5 8 4
Levels: 4 5 6 7 8
> as.numeric(levels(x))[x]
[1] 6 4 8 6 7 6 8 5 8 4
To show it works.
The timings :
> x <- factor(sample(4:8,500000,replace=T))
> system.time(as.numeric(as.character(x)))
user system elapsed
0.11 0.00 0.11
> system.time(as.numeric(levels(x))[x])
user system elapsed
0 0 0
It's a big improvement, but not always a bottleneck. It gets important however if you have a big dataframe and a lot of columns to convert.