Extract numeric part of strings of mixed numbers and characters in R

I have a lot of strings, and each of which tends to have the following format: Ab_Cd-001234.txt I want to replace it with 001234. How can I achieve it in R?


The stringr package has lots of handy shortcuts for this kind of work:

# input data following @agstudy
data <-  c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')

# load library
library(stringr)

# prepare regular expression
regexp <- "[[:digit:]]+"

# process string
str_extract(data, regexp)

Which gives the desired result:

  [1] "001234" "001234"

To explain the regexp a little:

[[:digit:]] is any number 0 to 9

+ means the preceding item (in this case, a digit) will be matched one or more times

This page is also very useful for this kind of string processing: http://en.wikibooks.org/wiki/R_Programming/Text_Processing


Using gsub or sub you can do this :

 gsub('.*-([0-9]+).*','\\1','Ab_Cd-001234.txt')
"001234"

you can use regexpr with regmatches

m <- gregexpr('[0-9]+','Ab_Cd-001234.txt')
regmatches('Ab_Cd-001234.txt',m)
"001234"

EDIT the 2 methods are vectorized and works for a vector of strings.

x <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
sub('.*-([0-9]+).*','\\1',x)
"001234" "001234"

 m <- gregexpr('[0-9]+',x)
> regmatches(x,m)
[[1]]
[1] "001234"

[[2]]
[1] "001234"