Transpose multiple sub-headers into factor column in R
Here is one possibility (assuming that all sub-headers begin with Fuente
) using tidyverse
. Here, I create a grouping column (idx
) by gathering all rows until Fuente
appears in a subsequent row. Then, I split those into individual dataframes and put into a list. Then, I use map
to apply a function to that list. I extract the text after Fuente
, then copy that to all rows in that dataframe. Finally, I bind the list of dataframes back together.
Tidyverse
library(tidyverse)
df %>%
group_by(idx = cumsum(str_detect(fecha, "Fuente"))) %>%
group_split(., .keep = FALSE) %>%
map(., function(x)
x %>%
mutate(Fuente = sub('.*:\\s*', "", fecha)[1]) %>%
slice(-1)) %>%
bind_rows()
Or if you do have other sub-headers besides Fuente
, then you could use "[a-z]" in group_by(idx = cumsum(str_detect(fecha, "[a-z]"))
, rather than "Fuente".
Data.table
Another option using data.table
:
setDT(dt)[, Fuente := ifelse(grepl(':', df$fecha, fixed = TRUE),
sub('.*:\\s*', "", df$fecha), NA)]
dt[, Fuente := Fuente[nafill(replace(.I, is.na(Fuente), NA), "locf")]]
dt <- dt[!grepl("Fuente", dt$fecha),]
Output
fecha producto origen pmin pmax pfrec obs category Fuente
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 20/02/2020 Aleta de raya Tabasco 23.00 27.00 25.00 "" pescado La Nueva Viga, DF
2 20/02/2020 Bandera Campeche 35.00 39.00 37.00 "" pescado La Nueva Viga, DF
3 20/02/2020 Besugo Veracruz 15.00 19.00 17.00 "" pescado La Nueva Viga, DF
4 20/02/2020 Cazón con cabeza Veracruz 60.00 65.00 63.00 "" pescado La Nueva Viga, DF
5 20/02/2020 Huachinango Golfo Tamaulipas 165.00 200.00 190.00 "OBS" pescado Monterrey, Nuevo León
6 20/02/2020 Pampano Tamaulipas -- -- 195.00 "OBS" pescado Monterrey, Nuevo León
7 20/02/2020 Sargo Tamaulipas -- -- 84.00 "OBS" pescado Monterrey, Nuevo León
8 20/02/2020 Trucha marina Tamaulipas -- -- 98.00 "OBS" pescado Monterrey, Nuevo León
9 17/02/2020 Huachinango Golfo Tamaulipas 210.00 220.00 215.00 "OBS" pescado Monterrey, Nuevo León
10 17/02/2020 Pampano Tamaulipas -- -- 195.00 "OBS" pescado Monterrey, Nuevo León
Benchmark
data.table
is faster than any of the tidyverse
options
Here is an alternative tidyverse
approach, main points are adding a new column with add_column
from tibble
package and data wrangling:
- filter all rows that contain
Fuente
- bind to original df to get equal column lengths!
- add a new column by getting long form of df filtering and wrangling
- remove with filter
!
rows containingFuente:
library(tidyverse)
df %>%
filter(if_any(everything(), ~str_detect(., "Fuente"))) %>%
bind_rows(df) %>%
add_column(df %>%
pivot_longer(everything(), values_to = "Fuente") %>%
filter(str_detect(Fuente, "Fuente")) %>%
mutate(Fuente = sub('.*:', '', Fuente)) %>%
select(-name)
)%>%
filter(!if_any(everything(), ~str_detect(fecha, "Fuente:")))
fecha producto origen pmin pmax pfrec obs category Fuente
1 20/02/2020 Aleta de raya Tabasco 23.00 27.00 25.00 pescado La Nueva Viga, DF
2 20/02/2020 Bandera Campeche 35.00 39.00 37.00 pescado La Nueva Viga, DF
3 20/02/2020 Besugo Veracruz 15.00 19.00 17.00 pescado La Nueva Viga, DF
4 20/02/2020 Cazón con cabeza Veracruz 60.00 65.00 63.00 pescado La Nueva Viga, DF
5 20/02/2020 Huachinango Golfo Tamaulipas 165.00 200.00 190.00 OBS pescado Monterrey, Nuevo León
6 20/02/2020 Pampano Tamaulipas -- -- 195.00 OBS pescado Monterrey, Nuevo León
7 20/02/2020 Sargo Tamaulipas -- -- 84.00 OBS pescado Monterrey, Nuevo León
8 20/02/2020 Trucha marina Tamaulipas -- -- 98.00 OBS pescado Monterrey, Nuevo León
9 17/02/2020 Huachinango Golfo Tamaulipas 210.00 220.00 215.00 OBS pescado Monterrey, Nuevo León
10 17/02/2020 Pampano Tamaulipas -- -- 195.00 OBS pescado Monterrey, Nuevo León