Extracting only first appearance from a list of patterns in R

You can use

df$Match <- str_extract(df$Text, paste0("(?i)\\b(", paste(country_list, collapse="|"), ")\\b"))
df <- within(df, Match[ofInterest == '0'] <- NA)
# > df
#                           Text ofInterest   Match
# 1                      This is          0    <NA>
# 2                    a test to          0    <NA>
# 3                 find country          0    <NA>
# 4           names like Algeria          1 Algeria
# 5      Albania and Afghanistan          1 Albania
# 6                  in the data          1    <NA>
# 7          and return only the          0    <NA>
# 8          first match in each          0    <NA>
# 9  string, Algeria and Albania          1 Algeria
# 10             not Afghanistan          0    <NA>

Here, paste0("(?i)\\b(", paste(country_list, collapse="|"), ")\\b") will create a pattern like

  • (?i) - case insensitive matching
  • \b - a word boundary
  • ( - start of a capturing group:
    • paste(country_list, collapse="|") will result in a |-separated list of country names, like Albania|Poland|France etc.
  • ) - end ofthe group
  • \b - word boundary.

The df <- within(df, Match[ofInterest == '0'] <- NA) will revert NA in all Match rows where ofInterest columnn value is 0.


Another possible solution, which is based on intersect with country_list, after having split each phrase into separate words, and taking the first element of the intersection:

library(tidyverse)
library(countrycode)

df %>% 
  rowwise %>% 
  mutate(Match = if_else(ofInterest == 1,
   intersect(unlist(str_split(Text,"\\s")), country_list)[1], NA_character_)) %>%
  ungroup

#> # A tibble: 10 × 3
#>    Text                        ofInterest Match  
#>    <chr>                            <dbl> <chr>  
#>  1 This is                              0 <NA>   
#>  2 a test to                            0 <NA>   
#>  3 find country                         0 <NA>   
#>  4 names like Algeria                   1 Algeria
#>  5 Albania and Afghanistan              1 Albania
#>  6 in the data                          1 <NA>   
#>  7 and return only the                  0 <NA>   
#>  8 first match in each                  0 <NA>   
#>  9 string, Algeria and Albania          1 Algeria
#> 10 not Afghanistan                      0 <NA>