Using str_split and removing duplicate values from the same record in R
You can do this with stringr::str_split
to make a list-column of genres. genre
will become a list of character vectors, which you can then unnest, then take distinct observations.
library(tidyverse)
books %>%
mutate(genre = str_split(genre, ", ")) %>%
unnest(genre) %>%
distinct()
#> # A tibble: 12 x 2
#> title genre
#> <chr> <chr>
#> 1 Harry Potter 1 Fantasy
#> 2 Harry Potter 1 Young Adult
#> 3 Harry Potter 1 Magic
#> 4 To Kill A Mockingbird Classics
#> 5 To Kill A Mockingbird Fiction
#> 6 To Kill A Mockingbird Historical
#> 7 To Kill A Mockingbird Historical Fiction
#> 8 To Kill A Mockingbird Academic
#> 9 The Hunger Games 1 Young Adult
#> 10 The Hunger Games 1 Fiction
#> 11 The Hunger Games 1 Science Fiction
#> 12 The Hunger Games 1 Dystopia
A shortcut here that I always forget about is separate_rows
, which does the splitting and unnesting in one step:
books %>%
separate_rows(genre, sep = ", ") %>%
distinct()
is equivalent to the previous block.
To get this to a wide format, you can use tidyr::spread
. To make column names "genre1"
, "genre2"
, etc. dynamically, I grouped by title and then numbered the unique genres for each title. That way, you don't need to know how many genre columns you need, like you would if you used tidyr::separate
to split the column instead.
books %>%
mutate(genre = str_split(genre, ", ")) %>%
unnest(genre) %>%
distinct() %>%
group_by(title) %>%
mutate(num = row_number() %>% paste0("genre", .)) %>%
spread(key = num, value = genre)
#> # A tibble: 3 x 6
#> # Groups: title [3]
#> title genre1 genre2 genre3 genre4 genre5
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Harry Potter 1 Fantasy Young Adult Magic <NA> <NA>
#> 2 The Hunger Games 1 Young Adult Fiction Science … Dystopia <NA>
#> 3 To Kill A Mockingbird Classics Fiction Historic… Historic… Acade…
Here a solution using data.table
and base R.
library(data.table)
setDT(books)
books = unique(books[, strsplit(genre, ", "), by = title])
books[, genre:= paste0("genre_", seq_along(V1)), by = title]
dcast(books, title ~ genre, value.var = "V1")
# title genre_1 genre_2 genre_3 genre_4 genre_5
# 1: Harry Potter 1 Fantasy Young Adult Magic <NA> <NA>
# 2: The Hunger Games 1 Young Adult Fiction Science Fiction Dystopia <NA>
# 3: To Kill A Mockingbird Classics Fiction Historical Historical Fiction Academic
You can use separate
with a step to remove non-unique genres before using separate
.
library(tidyverse)
library(magrittr)
#remove non-unique genres
books %<>% mutate(genre = map(str_split(genre, ', '), ~ paste(unique(.x), collapse = ',')))
#separate into columns
books %>%
separate(col = 2, into = paste0('genre', seq(max(str_count(books$genre, ',')) + 1L))
, sep = ',')
# # A tibble: 3 x 6
# title genre1 genre2 genre3 genre4 genre5
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Harry Potter 1 Fantasy Young Adult Magic NA NA
# 2 To Kill A Mockingbird Classics Fiction Historical Historical Fiction Academic
# 3 The Hunger Games 1 Young Adult Fiction Science Fiction Dystopia NA