R splitting string values containing columns and simultaneously duplicating other columns while allowing for letter prefix changes
This question relates to an earlier and simpler question here. I have a complex dataframe, the result of a join between two separate dataframes, comprising three columns containing alphanumerical values, ID numbers and area codes.
df <- data.frame(column1 = c("ab 34, 35, 36", "cb 23", "df 45, 46", "gh 21"),
column2 = c("ID_27", "ID_28", "ID_29", "ID_30"),
column3 = c("area51", "area52", "area53", "area54"))
df_join <- data.frame(column1 = c("ab 77, cb 35", "df 23, gh 73", "ij 45, kl 46, mn 21"),
column2 = c("ID_27", "ID_28", "ID_29"),
column3 = c("area51", "area52", "area53"))
df_joined <- rbind(df, df_join)
df_joined
column1 column2 column3
1 ab 34, 35, 36 ID_27 area51
2 cb 23 ID_28 area52
3 df 45, 46 ID_29 area53
4 gh 21 ID_30 area54
5 ab 77, cb 35 ID_27 area51
6 df 23, gh 73 ID_28 area52
7 ij 45, kl 46, mn 21 ID_29 area53
What I would like to do is clean up the first column by removing the commas and the spaces, so that I am left with neatly uniform 4 character values while simultaneously duplicating the values in the other columns. You will also notice that the two letter prefix changes for rows 5-7, the R code has to allow for this. This is the example of what I'm after:
new_df
# A tibble: 14 x 3
column1 column2 column3
<chr> <chr> <chr>
1 ab34 ID_27 area51
2 ab35 ID_27 area51
3 ab36 ID_27 area51
4 ab77 ID_27 area51
5 cb23 ID_28 area52
6 cb35 ID_27 area51
7 df23 ID_28 area52
8 df45 ID_29 area53
9 df46 ID_29 area53
10 gh21 ID_30 area54
11 gh73 ID_28 area52
12 ij45 ID_29 area53
13 kl46 ID_29 area53
14 mn21 ID_29 area53
The end result doesn't have to be a tibble, and I am working on transforming large dataframes so the code has to be applicable in a general sense. Any ideas? Tidyverse or older methods welcome?
Here are two approaches. The key function is mystrsplit
. It first captures characters and digits separately in two columns prefix
and digit
, then fills down all the NA
prefixes using last-observation-carried-forward, and last paste prefix
and digit
together. After mystrsplit
, we use standard data.table operations to bind all rows together.
library(data.table)
mystrsplit <- function(x) {
x[x == ""] <- NA_character_
x <- stringr::str_match_all(x, "(?<prefix>[a-z]+)? (?<digit>\\d+)")
lapply(x, function(s) stringr::str_c(tidyr:::fillDown(s[, "prefix"]), s[, "digit"]))
}
setDT(df_joined)[, c(
list(column1 = unlist(x <- mystrsplit(column1))),
lapply(.SD, rep, times = lengths(x))
), .SDcols = -"column1"]
Output
column1 column2 column3
1: ab34 ID_27 area51
2: ab35 ID_27 area51
3: ab36 ID_27 area51
4: cb23 ID_28 area52
5: df45 ID_29 area53
6: df46 ID_29 area53
7: gh21 ID_30 area54
8: ab77 ID_27 area51
9: cb35 ID_27 area51
10: df23 ID_28 area52
11: gh73 ID_28 area52
12: ij45 ID_29 area53
13: kl46 ID_29 area53
14: mn21 ID_29 area53
Or in a tidyverse
way
library(dplyr)
library(tidyr)
mystrsplit <- function(x) {
x[x == ""] <- NA_character_
x <- stringr::str_match_all(x, "(?<prefix>[a-z]+)? (?<digit>\\d+)")
lapply(x, function(s) stringr::str_c(tidyr:::fillDown(s[, "prefix"]), s[, "digit"]))
}
df_joined %>% mutate(column1 = mystrsplit(column1)) %>% unnest(column1)
Output
# A tibble: 14 x 3
column1 column2 column3
<chr> <chr> <chr>
1 ab34 ID_27 area51
2 ab35 ID_27 area51
3 ab36 ID_27 area51
4 cb23 ID_28 area52
5 df45 ID_29 area53
6 df46 ID_29 area53
7 gh21 ID_30 area54
8 ab77 ID_27 area51
9 cb35 ID_27 area51
10 df23 ID_28 area52
11 gh73 ID_28 area52
12 ij45 ID_29 area53
13 kl46 ID_29 area53
14 mn21 ID_29 area53