Create table from wrapped text in R
Edited:
From text based on variable named a I would like to obtain a table in which description cell will be unwrapped.
a <-
"
category variable description value
A A This is variable named as A 123
which is responsible for sth
B This is variable named as B 222.1
which is responsible for sth
else
B A This is sth 2
out of 4
Other c Other va This is variable named as 222
ategory riable other variable which can be
nullable
Other va This is variable named as 0
riable A other variable A which can
be nullable
"
Result which I want to have:
One option to achieve your desired result would be read your variable as a fixed width file using e.g. readr::read_fwf
and some additional data wrangling steps where I make use of tidyr
and dplyr
:
library(dplyr)
library(tidyr)
library(readr)
df <- readr::read_fwf(file = a, skip = 1)
names(df) <- unlist(df[1, ])
df <- df[-1,]
df %>%
filter(!is.na(description)) %>%
tidyr::fill(category, variable) %>%
group_by(category, variable) %>%
summarise(description = paste(description, collapse = " "), value = value[!is.na(value)])
#> `summarise()` has grouped output by 'category'. You can override using the `.groups` argument.
#> # A tibble: 2 × 4
#> # Groups: category [1]
#> category variable description value
#> <chr> <chr> <chr> <chr>
#> 1 A A This is variable named as A which is responsible for … 123
#> 2 A B This is variable named as B which is responsible for … 222.1
This is similar to @stefans. The main difference is this way requires you to specify column_widths
with readr::fwf_cols()
. (Which may be an advantage or disadvantage, depending on the consistency/stability of your incoming data files.)
a <-
"category variable description value
A A This is variable named as A 123
which is responsible for sth
B This is variable named as B 222.1
which is responsible for sth
else
"
column_widths <-
readr::fwf_cols(
category = 13,
variable = 8,
description = 32,
value = 10
)
I(a) |>
readr::read_fwf(
col_positions = column_widths,
skip = 1 # Because the headers are defined in `column_widths`
) |>
tidyr::fill(category, variable) |>
dplyr::mutate(
value = as.character(value),
value = dplyr::coalesce(value, "")
) |>
dplyr::group_by(category, variable) |>
dplyr::summarize(
description = paste0(description, collapse = " "),
value = as.numeric(paste0(value, collapse = " ")),
) |>
dplyr::ungroup()
Output:
# A tibble: 2 x 4
category variable description value
<chr> <chr> <chr> <dbl>
1 A A This is variable named as ~ 123
2 A B This is variable named as ~ 222.
I'm stumped, now that the example has changed so all cells can wrap to new lines. Hopefully @stefan has an idea.
A few rough suggestions.
-
Can you control how the text is created? Your OP was so thorough that I assume you explored & eliminated that possibility. But if you have control (such as widening the output parameters), that's the easiest and most direct.
-
I think you need at least one column to indicate when a new line is a new record, and when a new line is a continuation of the same record. In the first iteration of your OP, the
variable
column signaled this. -
This new indicator variable (called
record_id
) would have the values {1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5}. See how lines 1 & 2 belong to record 1, and lines 3-5 belong to record 2. -
You might programmatically modify the
a
string somehow. Maybe start withreadr::read_lines()
and use a regex to indicate what lines represent new records. Then use a typical readr function to convert array of strings to a data.frame.. We'd need to know more about how regular/repeatable the values are for the first two columns. I see you were considering this initially, given your [gsub] tag. -
Or read everything with as a fwf first, and use rematch2 to create the variable to indicate whether the record is new or a continuation. It all depends if there's a good pattern to the wrapping cells of
variable
. -
Does
a
represent a file on disk? How consistent are the line wraps? Ifrecord_id
is the same for all files, then you could manually determinerecord_id
once, and add it to each dataset. Then proceed similarly to the answers by @stefan and me.