Splitting a dataframe string column into multiple different columns [duplicate]
What I am trying to accomplish is splitting a column into multiple columns. I would prefer the first column to contain "F", second column "US", third "CA6" or "DL", and the fourth to be "Z13" or "U13" etc etc. My entire df follows the same pattern of X.XX.XXXX.XXX or X.XX.XXX.XXX or X.XX.XX.XXX and I know the third column is where my problem lies because of the different lengths. I have only used substr in the past and I could use that here with some if statements but would like to learn how to use stringr package and POSIX to do this (unless there is a better option). Thank you in advance.
Here is my df:
c("F.US.CLE.V13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13",
"F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13",
"F.US.DL.U13", "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.Z13", "F.US.DL.Z13"
)
Solution 1:
A very direct way is to just use read.table
on your character vector:
> read.table(text = text, sep = ".", colClasses = "character")
V1 V2 V3 V4
1 F US CLE V13
2 F US CA6 U13
3 F US CA6 U13
4 F US CA6 U13
5 F US CA6 U13
6 F US CA6 U13
7 F US CA6 U13
8 F US CA6 U13
9 F US DL U13
10 F US DL U13
11 F US DL U13
12 F US DL Z13
13 F US DL Z13
colClasses
needs to be specified, otherwise F
gets converted to FALSE
(which is something I need to fix in "splitstackshape", otherwise I would have recommended that :) )
Update (> a year later)...
Alternatively, you can use my cSplit
function, like this:
cSplit(as.data.table(text), "text", ".")
# text_1 text_2 text_3 text_4
# 1: F US CLE V13
# 2: F US CA6 U13
# 3: F US CA6 U13
# 4: F US CA6 U13
# 5: F US CA6 U13
# 6: F US CA6 U13
# 7: F US CA6 U13
# 8: F US CA6 U13
# 9: F US DL U13
# 10: F US DL U13
# 11: F US DL U13
# 12: F US DL Z13
# 13: F US DL Z13
Or, separate
from "tidyr", like this:
library(dplyr)
library(tidyr)
as.data.frame(text) %>% separate(text, into = paste("V", 1:4, sep = "_"))
# V_1 V_2 V_3 V_4
# 1 F US CLE V13
# 2 F US CA6 U13
# 3 F US CA6 U13
# 4 F US CA6 U13
# 5 F US CA6 U13
# 6 F US CA6 U13
# 7 F US CA6 U13
# 8 F US CA6 U13
# 9 F US DL U13
# 10 F US DL U13
# 11 F US DL U13
# 12 F US DL Z13
# 13 F US DL Z13
Solution 2:
Is this what you are trying to do?
# Our data
text <- c("F.US.CLE.V13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13",
"F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13",
"F.US.DL.U13", "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.Z13", "F.US.DL.Z13"
)
# Split into individual elements by the '.' character
# Remember to escape it, because '.' by itself matches any single character
elems <- unlist( strsplit( text , "\\." ) )
# We know the dataframe should have 4 columns, so make a matrix
m <- matrix( elems , ncol = 4 , byrow = TRUE )
# Coerce to data.frame - head() is just to illustrate the top portion
head( as.data.frame( m ) )
# V1 V2 V3 V4
#1 F US CLE V13
#2 F US CA6 U13
#3 F US CA6 U13
#4 F US CA6 U13
#5 F US CA6 U13
#6 F US CA6 U13
Solution 3:
The way via unlist
and matrix
seems a bit convoluted, and requires you to hard-code the number of elements (this is actually a pretty big no-go. Of course you could circumvent hard-coding that number and determine it at run-time)
I would go a different route, and construct a data frame directly from the list that strsplit
returns. For me, this is conceptually simpler. There are essentially two ways of doing this:
-
as.data.frame
– but since the list is exactly the wrong way round (we have a list of rows rather than a list of columns) we have to transpose the result. We also clear therownames
since they are ugly by default (but that’s strictly unnecessary!):`rownames<-`(t(as.data.frame(strsplit(text, '\\.'))), NULL)
-
Alternatively, use
rbind
to construct a data frame from the list of rows. We usedo.call
to callrbind
with all the rows as separate arguments:do.call(rbind, strsplit(text, '\\.'))
Both ways yield the same result:
[,1] [,2] [,3] [,4]
[1,] "F" "US" "CLE" "V13"
[2,] "F" "US" "CA6" "U13"
[3,] "F" "US" "CA6" "U13"
[4,] "F" "US" "CA6" "U13"
[5,] "F" "US" "CA6" "U13"
[6,] "F" "US" "CA6" "U13"
…
Clearly, the second way is much simpler than the first.