POSIX character class does not work in base R regex
I'm having some problems matching a pattern with a string of text in R
.
I'm trying to get TRUE
with grepl
when the text is something like "lettersornumbersorspaces y lettersornumbersorspaces".
I'm using the following regex
:
([:alnum:]|[:blank:])+[:blank:][yY][:blank:]([:alnum:]|[:blank:])+
When using the regex
as follows to obtain the "address" it works at expected.
regex <- "([:alnum:]|[:blank:])+[:blank:][yY][:blank:]([:alnum:]|[:blank:])+"
address <- str_extract(fulltext, regex)
I see that address is the text that I need. Now, if I want to use grepl
to get a TRUE
as follows:
grepl("([:alnum:]|[:blank:])+[:blank:][yY][:blank:]([:alnum:]|[:blank:])+", address,ignore.case = TRUE)
FALSE
is returned. How is this possible? I'm using the same regex
to get TRUE
. I have tried modifications to the grepl
parameters, but non of them is related to this.
An example of text is: "26 de Marzo y Pareyra de la Luz"
Thanks!!
Solution 1:
Although stringr ICU regex engines supports bare POSIX character classes in the pattern, in base R regex flavors (both PCRE (perl=TRUE
) and TRE), POSIX character classes must be inside bracket expressions. [:alnum:]
-> [[:alnum:]]
.
x <- c("AZaz09 y AZaz09", "ĄŻaz09 y AZŁł09", "26 de Marzo y Pareyra de la Luz")
grepl("[[:alnum:][:blank:]]+[[:blank:]][yY][[:blank:]][[:alnum:][:blank:]]+", x)
## => [1] TRUE TRUE TRUE
grepl("[[:alnum:][:blank:]]+[[:blank:]][yY][[:blank:]][[:alnum:][:blank:]]+", x, perl=TRUE)
## => [1] TRUE TRUE TRUE
See the online demo
When you use [:alnum:]
alone, it is a simple bracket expression that matches a single character, a :
, a
, l
, n
, u
, m
.
Pattern details:
-
[[:alnum:][:blank:]]+
- 1+ alphanumeric or horizontal whitespace symbols -
[[:blank:]]
- 1 horizontal whitespace symbols -
[yY]
- eithery
orY
-
[[:blank:]]
- 1 horizontal whitespace symbols -
[[:alnum:][:blank:]]+
- 1+ alphanumeric or horizontal whitespace symbols