Filter rows where mirror-image delimiters are not paired
May use the pattern (\\[[^\\]]+(\\[|$)|(^|\\])[^\\[]+\\]
) in str_detect
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Utterance, "\\[[^\\]]+(\\[|$)|(^|\\])[^\\[]+\\]"))
id Utterance
1 1 [but if I came !ho!me
2 3 =[yeah] I mean [does it
3 4 bu[t if (.) you know
4 5 =ye::a:h]
5 6 [that's right] YEAH (laughs)] [ye::a:h]
6 8 [cos] I've [heard very sketchy [stories]
7 9 oh well] that's great
Here we check for a opening bracket [
followed by one or more characters that are not ]
followed by a [
or the end of the string ($
) or a similar pattern for the closing bracket
Another possible solution, using purrr::map_dfr
.
EXPLANATION
I provide, in what follows, an explanation for my solution, as asked for by @ChrisRuehlemann:
-
With
str_extract_all(df$Utterance, "\\[|\\]")
, we extract all[
and]
of each utterance as a list and according to the order they appear in the utterance. -
We iterate all lists created previously for the utterances. However, we have a list of square brackets. So, we need to beforehand collapse the list into a single string of square brackets (
str_c(.x, collapse = "")
). -
We compare the string of square brackets of each utterance with a string like the following
[][][]...
(str_c(rep("[]", length(.x)/2), collapse = "")
). If these two strings are not equal, then square brackets are missing! -
When
map_dfr
finishes, we end up with a column ofTRUE
andFALSE
, which we can use to filter the original dataframe as wanted.
library(tidyverse)
str_extract_all(df$Utterance, "\\[|\\]") %>%
map_dfr(~ list(OK = str_c(.x, collapse = "") !=
str_c(rep("[]", length(.x)/2), collapse = ""))) %>%
filter(df,.)
#> id Utterance
#> 1 1 [but if I came !ho!me
#> 2 3 =[yeah] I mean [does it
#> 3 4 bu[t if (.) you know
#> 4 5 =ye::a:h]
#> 5 6 [that's right] YEAH (laughs)] [ye::a:h]
#> 6 8 [cos] I've [heard very sketchy [stories]
#> 7 9 oh well] that's great