Extract a regular expression match

Solution 1:

Use the new stringr package which wraps all the existing regular expression operates in a consistent syntax and adds a few that are missing:

library(stringr)
str_locate("aaa12xxx", "[0-9]+")
#      start end
# [1,]     4   5
str_extract("aaa12xxx", "[0-9]+")
# [1] "12"

Solution 2:

It is probably a bit hasty to say 'ignore the standard functions' - the help file for ?gsub even specifically references in 'See also':

‘regmatches’ for extracting matched substrings based on the results of ‘regexpr’, ‘gregexpr’ and ‘regexec’.

So this will work, and is fairly simple:

txt <- "aaa12xxx"
regmatches(txt,regexpr("[0-9]+",txt))
#[1] "12"

Solution 3:

For your specific case you could remove all not numbers:

gsub("[^0-9]", "", "aaa12xxxx")
# [1] "12"

It won't work in more complex cases

gsub("[^0-9]", "", "aaa12xxxx34")
# [1] "1234"

Solution 4:

You can use PERL regexs' lazy matching:

> sub(".*?([0-9]+).*", "\\1", "aaa12xx99",perl=TRUE)
[1] "12"

Trying to substitute out non-digits will lead to an error in this case.

Solution 5:

Use capturing parentheses in the regular expression and group references in the replacement. Anything in parentheses gets remembered. Then they're accessed by \2, the first item. The first backslash escapes the backslash's interpretation in R so that it gets passed to the regular expression parser.

gsub('([[:alpha:]]+)([0-9]+)([[:alpha:]]+)', '\\2', "aaa12xxx")