Sub string from character string using regex in R
I'm scraping PDF reports for their data. I'm trying to extract the location the report is based off. I've got a character string with the location, and then a rolling 13 months header seen here:
header_line <- "Corp Dec '20 Jan '21 Feb '21 Mar '21 Apr '21 May '21 Jun '21 Jul '21 Aug '21 Sep '21 Oct '21 Nov '21 Dec '21"
I'd like to extract all characters from the beginning of the string to the start of WHATEVER month could be appearing after Because it's a rolling 13-month report, it could be any of those months abbreviations next to the location.
I have this working for the above example, but I'm not sure how to create an "Or pattern" with regex. I know I could brute force it with a loop or apply function, but I was hoping there was a less dirty way.
stringr::str_extract(header_line, "[^Dec]+")
[1] "Corp "
Solution 1:
It is difficult to anticipate the possible cases that the location could be, but the below solution may cover most of it. It will match everything prior to 3 alphabetical characters, followed by a space, and apostrophe, and 2 digits.
str_extract(header_line, '^(.*?)(?=[a-zA-Z]{3}\\s\'\\d{2})')
Test cases:
header_line <- "Corp Dec '20 Jan '21 Feb '21 Mar '21 Apr '21 May '21 Jun '21 Jul '21 Aug '21 Sep '21 Oct '21 Nov '21 Dec '21"
header_line2 <- "Corp multiple words Dec '20 Jan '21 Feb '21 Mar '21 Apr '21 May '21 Jun '21 Jul '21 Aug '21 Sep '21 Oct '21 Nov '21 Dec '21"
header_line3 <- "Corp multiple words 1 Dec '20 Jan '21 Feb '21 Mar '21 Apr '21 May '21 Jun '21 Jul '21 Aug '21 Sep '21 Oct '21 Nov '21 Dec '21"
header_line4 <- "Corp multiple 444 Dec '20 Jan '21 Feb '21 Mar '21 Apr '21 May '21 Jun '21 Jul '21 Aug '21 Sep '21 Oct '21 Nov '21 Dec '21"
str_extract(header_line, '^(.*?)(?=[a-zA-Z]{3}\\s\'\\d{2})')
[1] "Corp"
str_extract(header_line2, '^(.*?)(?=[a-zA-Z]{3}\\s\'\\d{2})')
[1] "Corp multiple words"
str_extract(header_line3, '^(.*?)(?=[a-zA-Z]{3}\\s\'\\d{2})')
[1] "Corp multiple words 1"
str_extract(header_line4, '^(.*?)(?=[a-zA-Z]{3}\\s\'\\d{2})')
[1] "Corp multiple 444"