Find length and placement of alphabet characters in character string R
I have a dataframe composed of 3 columns:
df <- data.frame(column1 = c("ab 34, 35, 36", "cb 23", "df 45, gh 46", "gh 21"),
column2 = c("ID_27", "ID_28", "ID_29", "ID_30"),
column3 = c("area51", "area52", "area53", "area54"))
> df
column1 column2 column3
1 ab 34, 35, 36 ID_27 area51
2 cb 23 ID_28 area52
3 df 45, gh 46 ID_29 area53
4 gh 21 ID_30 area54
I need to identify the length of the letter prefixes in column1. This means being able to identify and filter rows similar to row 3 here which contains more than two letters, but in a large data set. Eg (output after filter).
column1 column2 column3
3 df 45, gh 46 ID_29 area53
Ideally I would also like to get the position of the first letter in each prefix. This could be achieved by putting the length and position values into two additional but separate columns in the dataframe, or by filtering it. Either way which achieves the same result is fine. Both tidyverse and older approaches welcome.
If anyone wants to go further, the idea is to attach the two letter prefix to the number codes which follow it, up until the letter prefix changes like the example above. The following step is to then paste or copy these new values in separate rows while simultaneously duplicating the values in the other columns.
df <- data.frame(column1 = c("ab 34, 35, 36", "cb 23", "df 45, gh 46", "gh 21"),
column2 = c("ID_27", "ID_28", "ID_29", "ID_30"),
column3 = c("area51", "area52", "area53", "area54"))
library(dplyr)
library(stringr)
df %>% filter(str_count(column1, "[a-z]+") > 1)
# A tibble: 1 x 3
# Groups: column3 [1]
column1 column2 column3
<fct> <fct> <fct>
1 df 45, gh 46 ID_29 area53
Here's a solution that gets you pretty far toward your ultimate goal: "if anyone wants to go further, the idea is to attach the two letter prefix to the number codes which follow it, up until the letter prefix changes like the example above. The following step is to then paste or copy these new values in separate rows while simultaneously duplicating the values in the other columns."
library(dplyr)
library(stringr)
library(tidyr)
df %>%
mutate(
# isolate the prefix in a list:
prefix = str_extract_all(column1,"[a-z]+")) %>%
# cast the list values in `prefix` into long format:
unnest_longer(prefix) %>%
# extract the digits in a list:
mutate(
digits = str_extract_all(column1,"\\d+")) %>%
# cast the list values in `digits` into long format:
unnest_longer(digits) %>%
# paste `prefix`, `digits` together:
mutate(column1 = paste0(prefix, digits)) %>%
# remove obsolete columns:
select(-c(prefix, digits))
# A tibble: 9 × 3
column1 column2 column3
<chr> <chr> <chr>
1 ab34 ID_27 area51
2 ab35 ID_27 area51
3 ab36 ID_27 area51
4 cb23 ID_28 area52
5 df45 ID_29 area53
6 df46 ID_29 area53
7 gh45 ID_29 area53
8 gh46 ID_29 area53
9 gh21 ID_30 area54