Create table from wrapped text in R

Edited:

From text based on variable named a I would like to obtain a table in which description cell will be unwrapped.

a <- 
  "
   category     variable    description                    value
   A            A           This is variable named as A    123
                            which is responsible for sth
                B           This is variable named as B    222.1 
                            which is responsible for sth
                            else
   B            A           This is sth                    2
                                                           out of 4  
   Other c      Other va    This is variable named as      222  
   ategory      riable      other variable which can be   
                            nullable
                Other va    This is variable named as      0  
                riable A    other variable A which can   
                            be nullable
  "

Result which I want to have:

enter image description here


One option to achieve your desired result would be read your variable as a fixed width file using e.g. readr::read_fwf and some additional data wrangling steps where I make use of tidyr and dplyr:

library(dplyr)
library(tidyr)
library(readr)

df <- readr::read_fwf(file = a, skip = 1)
names(df) <- unlist(df[1, ])
df <- df[-1,]
df %>% 
  filter(!is.na(description)) %>% 
  tidyr::fill(category, variable) %>% 
  group_by(category, variable) %>% 
  summarise(description = paste(description, collapse = " "), value = value[!is.na(value)])
#> `summarise()` has grouped output by 'category'. You can override using the `.groups` argument.
#> # A tibble: 2 × 4
#> # Groups:   category [1]
#>   category variable description                                            value
#>   <chr>    <chr>    <chr>                                                  <chr>
#> 1 A        A        This is variable named as A which is responsible for … 123  
#> 2 A        B        This is variable named as B which is responsible for … 222.1

This is similar to @stefans. The main difference is this way requires you to specify column_widths with readr::fwf_cols(). (Which may be an advantage or disadvantage, depending on the consistency/stability of your incoming data files.)

a <- 
"category     variable    description                    value
A            A           This is variable named as A    123
                         which is responsible for sth
             B           This is variable named as B    222.1 
                         which is responsible for sth
                         else
"
column_widths <-
  readr::fwf_cols(
    category        = 13,
    variable        = 8,
    description     = 32,
    value           = 10
  )

I(a) |> 
  readr::read_fwf(
    col_positions = column_widths,
    skip          = 1         # Because the headers are defined in `column_widths`
  ) |> 
  tidyr::fill(category, variable) |> 
  dplyr::mutate(
    value   = as.character(value),
    value   = dplyr::coalesce(value, "")
  ) |> 
  dplyr::group_by(category, variable) |> 
  dplyr::summarize(
    description = paste0(description, collapse = " "), 
    value       = as.numeric(paste0(value, collapse = " ")), 
  ) |> 
  dplyr::ungroup()

Output:

# A tibble: 2 x 4
  category variable description                 value
  <chr>    <chr>    <chr>                       <dbl>
1 A        A        This is variable named as ~  123 
2 A        B        This is variable named as ~  222.

I'm stumped, now that the example has changed so all cells can wrap to new lines. Hopefully @stefan has an idea.

A few rough suggestions.

  1. Can you control how the text is created? Your OP was so thorough that I assume you explored & eliminated that possibility. But if you have control (such as widening the output parameters), that's the easiest and most direct.

  2. I think you need at least one column to indicate when a new line is a new record, and when a new line is a continuation of the same record. In the first iteration of your OP, the variable column signaled this.

  3. This new indicator variable (called record_id) would have the values {1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5}. See how lines 1 & 2 belong to record 1, and lines 3-5 belong to record 2.

  4. You might programmatically modify the a string somehow. Maybe start with readr::read_lines() and use a regex to indicate what lines represent new records. Then use a typical readr function to convert array of strings to a data.frame.. We'd need to know more about how regular/repeatable the values are for the first two columns. I see you were considering this initially, given your [gsub] tag.

  5. Or read everything with as a fwf first, and use rematch2 to create the variable to indicate whether the record is new or a continuation. It all depends if there's a good pattern to the wrapping cells of variable.

  6. Does a represent a file on disk? How consistent are the line wraps? If record_id is the same for all files, then you could manually determine record_id once, and add it to each dataset. Then proceed similarly to the answers by @stefan and me.