Extracting only first appearance from a list of patterns in R
You can use
df$Match <- str_extract(df$Text, paste0("(?i)\\b(", paste(country_list, collapse="|"), ")\\b"))
df <- within(df, Match[ofInterest == '0'] <- NA)
# > df
# Text ofInterest Match
# 1 This is 0 <NA>
# 2 a test to 0 <NA>
# 3 find country 0 <NA>
# 4 names like Algeria 1 Algeria
# 5 Albania and Afghanistan 1 Albania
# 6 in the data 1 <NA>
# 7 and return only the 0 <NA>
# 8 first match in each 0 <NA>
# 9 string, Algeria and Albania 1 Algeria
# 10 not Afghanistan 0 <NA>
Here, paste0("(?i)\\b(", paste(country_list, collapse="|"), ")\\b")
will create a pattern like
-
(?i)
- case insensitive matching -
\b
- a word boundary -
(
- start of a capturing group:-
paste(country_list, collapse="|")
will result in a|
-separated list of country names, likeAlbania|Poland|France
etc.
-
-
)
- end ofthe group -
\b
- word boundary.
The df <- within(df, Match[ofInterest == '0'] <- NA)
will revert NA
in all Match
rows where ofInterest
columnn value is 0
.
Another possible solution, which is based on intersect
with country_list
, after having split each phrase into separate words, and taking the first element of the intersection:
library(tidyverse)
library(countrycode)
df %>%
rowwise %>%
mutate(Match = if_else(ofInterest == 1,
intersect(unlist(str_split(Text,"\\s")), country_list)[1], NA_character_)) %>%
ungroup
#> # A tibble: 10 × 3
#> Text ofInterest Match
#> <chr> <dbl> <chr>
#> 1 This is 0 <NA>
#> 2 a test to 0 <NA>
#> 3 find country 0 <NA>
#> 4 names like Algeria 1 Algeria
#> 5 Albania and Afghanistan 1 Albania
#> 6 in the data 1 <NA>
#> 7 and return only the 0 <NA>
#> 8 first match in each 0 <NA>
#> 9 string, Algeria and Albania 1 Algeria
#> 10 not Afghanistan 0 <NA>