split character data into numbers and letters
I have a vector of character data. Most of the elements in the vector consist of one or more letters followed by one or more numbers. I wish to split each element in the vector into the character portion and the number portion. I found a similar question on Stackoverflow.com here:
split a character from a number with multiple digits
However, the answer given above does not seem to work completely in my case or I am doing something wrong. An example vector is below:
my.data <- c("aaa", "b11", "b21", "b101", "b111", "ccc1", "ddd1", "ccc20", "ddd13")
# I can obtain the number portion using:
gsub("[^[:digit:]]", "", my.data)
# However, I cannot obtaining the character portion using:
gsub("[:digit:]", "", my.data)
How can I obtain the character portion? I am using R version 2.14.1 on a Windows 7 64-bit machine.
Solution 1:
For your regex you have to use:
gsub("[[:digit:]]","",my.data)
The [:digit:]
character class only makes sense inside a set of []
.
Solution 2:
Since none of the previous answers use tidyr::separate
here it goes:
library(tidyr)
df <- data.frame(mycol = c("APPLE348744", "BANANA77845", "OATS2647892", "EGG98586456"))
df %>%
separate(mycol,
into = c("text", "num"),
sep = "(?<=[A-Za-z])(?=[0-9])"
)
Solution 3:
With stringr
, if you like (and slightly different from the answer to the other question):
# load library
library(stringr)
#
# load data
my.data <- c("aaa", "b11", "b21", "b101", "b111", "ccc1", "ddd1", "ccc20", "ddd13")
#
# extract numbers only
my.data.num <- as.numeric(str_extract(my.data, "[0-9]+"))
#
# check output
my.data.num
[1] NA 11 21 101 111 1 1 20 13
#
# extract characters only
my.data.cha <- (str_extract(my.data, "[aA-zZ]+"))
#
# check output
my.data.cha
[1] "aaa" "b" "b" "b" "b" "ccc" "ddd" "ccc" "ddd"
Solution 4:
Late answer, but another option is to use strsplit
with a regex pattern which uses lookarounds to find the boundary between numbers and letters:
var <- "ABC123"
strsplit(var, "(?=[A-Za-z])(?<=[0-9])|(?=[0-9])(?<=[A-Za-z])", perl=TRUE)
[[1]]
[1] "ABC" "123"
The above pattern will match (but not consume) when either the previous character is a letter and the following character is a number, or vice-versa. Note that we use strsplit
in Perl mode to access lookarounds.
Demo
Solution 5:
A slightly more elegant way (without any external packages):
> x = c("aaa", "b11", "b21", "b101", "b111", "ccc1", "ddd1", "ccc20", "ddd13")
> gsub('\\D','', x) # replaces non-digits with blancs
[1] "" "11" "21" "101" "111" "1" "1" "20" "13"
> gsub('\\d','', x) # replaces digits with blanks
[1] "aaa" "b" "b" "b" "b" "ccc" "ddd" "ccc" "ddd"