How to prevent regmatches drop non matches?

I would like to capture the first match, and return NA if there is no match.

regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
# [1]  1 -1  3  1
# attr(,"match.length")
# [1]  1 -1  1  2

x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
regmatches(x, m)
# [1]  "a"  "a"  "aa"

So I expected "a", NA, "a", "aa"

Solution 1:

Staying with regexpr:

r <- regexpr("a+", x)
out <- rep(NA,length(x))
out[r!=-1] <- regmatches(x, r)
out
#[1] "a"  NA   "a"  "aa"

Solution 2:

use regexec instead, since it returns a list which will allow you to catch the character(0)'s before unlisting

 R <- regmatches(x, regexec("a+", x))
 unlist({R[sapply(R, length)==0] <- NA; R})

 # [1] "a"  NA   "a"  "aa"

Solution 3:

In R 3.3.0, it is possible to pull out both the matches and the non-matched results using the invert=NA argument. From the help file, it says

if invert is NA, regmatches extracts both non-matched and matched substrings, always starting and ending with a non-match (empty if the match occurred at the beginning or the end, respectively).

The output is a list, typically, in most cases of interest, (matching a single pattern), regmatches with this argument will return a list with elements of either length 3 or 1. 1 is the case of where no matches are found and 3 is the case with a match.

myMatch <- regmatches(x, m, invert=NA)
myMatch
[[1]]
[1] ""   "a"  "bc"

[[2]]
[1] "def"

[[3]]
[1] "cb" "a"  " a"

[[4]]
[1] ""   "aa" ""

So to extract what you want (with "" in place of NA), you can use sapply as follows:

myVec <- sapply(myMatch, function(x) {if(length(x) == 1) "" else x[2]})
myVec
[1] "a"  ""   "a"  "aa"

At this point, if you really want NA instead of "", you can use

is.na(myVec) <- nchar(myVec) == 0L
myVec
[1] "a"  NA   "a"  "aa"

Some revisions:
Note that you can collapse the last two lines into a single line:

myVec <- sapply(myMatch, function(x) {if(length(x) == 1) NA_character_ else x[2]})

The default data type of NA is logical, so using it will result in additional data conversions. Using the character version NA_character_, avoids this.

An even slicker extraction method for the final line is to use [:

sapply(myMatch, `[`, 2)
[1] "a"  NA   "a"  "aa"

So you can do the whole thing in a fairly readable single line:

sapply(regmatches(x, m, invert=NA), `[`, 2)

Solution 4:

Using more or less the same construction as yours -

chars <- c("abc", "def", "cba a", "aa")    

chars[
   regexpr("a+", chars, perl=TRUE) > 0
][1] #abc

chars[
   regexpr("q", chars, perl=TRUE) > 0
][1]  #NA

#vector[
#    find all indices where regexpr returned positive value i.e., match was found
#][return the first element of the above subset]

Edit - Seems like I misunderstood the question. But since two people have found this useful I shall let it stay.

How to prevent regmatches drop non matches?

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Related

Recent Posts