How can I remove repeated characters in a string with R?

r string

I would like to implement a function with R that removes repeated characters in a string. For instance, say my function is named removeRS, so it is supposed to work this way:

  removeRS('Buenaaaaaaaaa Suerrrrte')
  Buena Suerte
  removeRS('Hoy estoy tristeeeeeee')
  Hoy estoy triste

My function is going to be used with strings written in spanish, so it is not that common (or at least correct) to find words that have more than three successive vowels. No bother about the possible sentiment behind them. Nonetheless, there are words that can have two successive consonants (especially ll and rr), but we could skip this from our function.

So, to sum up, this function should replace the letters that appear at least three times in a row with just that letter. In one of the examples above, aaaaaaaaa is replaced with a.

Could you give me any hints to carry out this task with R?

I did not think very carefully on this, but this is my quick solution using references in regular expressions:

gsub('([[:alpha:]])\\1+', '\\1', 'Buenaaaaaaaaa Suerrrrte')
# [1] "Buena Suerte"

() captures a letter first, \\1 refers to that letter, + means to match it once or more; put all these pieces together, we can match a letter two or more times.

To include other characters besides alphanumerics, replace [[:alpha:]] with a regex matching whatever you wish to include.

I think you should pay attention to the ambiguities in your problem description. This is a first stab, but it clearly does not work with "Good Luck" in the manner you desire:

removeRS <- function(str) paste(rle(strsplit(str, "")[[1]])$values, collapse="")
removeRS('Buenaaaaaaaaa Suerrrrte')
#[1] "Buena Suerte"

How can I remove repeated characters in a string with R?

Related

Recent Posts