Using str_split and removing duplicate values from the same record in R

r stringr

You can do this with stringr::str_split to make a list-column of genres. genre will become a list of character vectors, which you can then unnest, then take distinct observations.

library(tidyverse)

books %>%
  mutate(genre = str_split(genre, ", ")) %>%
  unnest(genre) %>%
  distinct()
#> # A tibble: 12 x 2
#>    title                 genre             
#>    <chr>                 <chr>             
#>  1 Harry Potter 1        Fantasy           
#>  2 Harry Potter 1        Young Adult       
#>  3 Harry Potter 1        Magic             
#>  4 To Kill A Mockingbird Classics          
#>  5 To Kill A Mockingbird Fiction           
#>  6 To Kill A Mockingbird Historical        
#>  7 To Kill A Mockingbird Historical Fiction
#>  8 To Kill A Mockingbird Academic          
#>  9 The Hunger Games 1    Young Adult       
#> 10 The Hunger Games 1    Fiction           
#> 11 The Hunger Games 1    Science Fiction   
#> 12 The Hunger Games 1    Dystopia

A shortcut here that I always forget about is separate_rows, which does the splitting and unnesting in one step:

books %>%
  separate_rows(genre, sep = ", ") %>%
  distinct()

is equivalent to the previous block.

To get this to a wide format, you can use tidyr::spread. To make column names "genre1", "genre2", etc. dynamically, I grouped by title and then numbered the unique genres for each title. That way, you don't need to know how many genre columns you need, like you would if you used tidyr::separate to split the column instead.

books %>%
  mutate(genre = str_split(genre, ", ")) %>%
  unnest(genre) %>%
  distinct() %>%
  group_by(title) %>%
  mutate(num = row_number() %>% paste0("genre", .)) %>%
  spread(key = num, value = genre)
#> # A tibble: 3 x 6
#> # Groups:   title [3]
#>   title                 genre1      genre2      genre3    genre4    genre5
#>   <chr>                 <chr>       <chr>       <chr>     <chr>     <chr> 
#> 1 Harry Potter 1        Fantasy     Young Adult Magic     <NA>      <NA>  
#> 2 The Hunger Games 1    Young Adult Fiction     Science … Dystopia  <NA>  
#> 3 To Kill A Mockingbird Classics    Fiction     Historic… Historic… Acade…

Here a solution using data.table and base R.

library(data.table)
setDT(books)

books = unique(books[, strsplit(genre, ", "), by = title])
books[, genre:= paste0("genre_", seq_along(V1)), by = title]
dcast(books, title ~ genre, value.var = "V1")
#                    title     genre_1     genre_2         genre_3            genre_4  genre_5
# 1:        Harry Potter 1     Fantasy Young Adult           Magic               <NA>     <NA>
# 2:    The Hunger Games 1 Young Adult     Fiction Science Fiction           Dystopia     <NA>
# 3: To Kill A Mockingbird    Classics     Fiction      Historical Historical Fiction Academic

You can use separate with a step to remove non-unique genres before using separate.

library(tidyverse)
library(magrittr)

#remove non-unique genres
books %<>% mutate(genre = map(str_split(genre, ', '), ~ paste(unique(.x), collapse = ','))) 

#separate into columns
books %>% 
  separate(col = 2, into = paste0('genre', seq(max(str_count(books$genre, ',')) + 1L))
           , sep = ',')


# # A tibble: 3 x 6
#   title                 genre1      genre2      genre3          genre4             genre5  
#   <chr>                 <chr>       <chr>       <chr>           <chr>              <chr>   
# 1 Harry Potter 1        Fantasy     Young Adult Magic           NA                 NA      
# 2 To Kill A Mockingbird Classics    Fiction     Historical      Historical Fiction Academic
# 3 The Hunger Games 1    Young Adult Fiction     Science Fiction Dystopia           NA

Converting Data Frames to XTS: order.by requires an appropriate time-based object

when I hit yarn run dev . it shows 'next' is not recognized as an internal or external command, operable program or batch file

Getting class does not exist error when running database seeder

How to update the array objects if id exist otherwise insert in mongodb

Running powershell script inside a bash script

Access form will not populate records from or write to ODBC table

Bad state: Cannot set the body fields of a Request with content-type "application/json"

Why doesn't my Hello World work in react?

What do the headers in the PowerShell ps output mean?

Text search direction in Vim

Git chmod problem: Checkout screws exec bit

How to fix Time Machine's "This backup is too large for the backup disk"?