Find repeated words in a text

One of most common typos is to repeat the same word twice, as as here. I need an automatic procedure to remove all the repeated words in a text file. This should not be a strange feature for a modern editor or spell-checker, for example I remember that MS Word introduced this feature several years ago! Apparently, the default spell-check on my OS (hun-spell) can't do this, as it only finds words not in the dictionary.

It would be OK to have a solution valid for a specific text editor editor for linux (pluma/gedit2 or Sublime-text) and a solution based on a bash script.

With GNU grep:

echo 'Hi! Hi, same word twice twice, as as here here! ! ,123 123 need' |  grep -Eo '(\b.+) \1\b'

Output:

twice twice
as as
here here
123 123

Options:

-E: Interpret (\b.+) \1\b as an extended regular expression.

-o: Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

Regex:

\b: Is a zero-width word boundary.

.+: Matches one or more characters.

\1: The parentheses () mark a capturing group and \1 means use here the value from first capturing group.

Reference: The Stack Overflow Regular Expressions FAQ

It sounds like something like this is what you want (using any awk in any shell on every UNIX box):

$ cat tst.awk
BEGIN { RS=""; ORS="\n\n" }
{
    head = prev = ""
    tail = $0
    while ( match(tail,/[[:alpha:]]+/) ) {
        word = substr(tail,RSTART,RLENGTH)
        head = head substr(tail,1,RSTART-1) (word == prev ? "" : word)
        tail = substr(tail,RSTART+RLENGTH)
        prev = word
    }
    print head tail
}

$ cat file
the quick quick brown
fox jumped jumped
jumped over the lazy
lazy dogs back

$ awk -f tst.awk file
the quick  brown
fox jumped
 over the lazy
 dogs back

but please ask a new question with more truly representative sample input and expected output including punctuation, differences in capitalization, multiple paragraphs, duplicated words at the start/end of sentences and various other non-trivial cases are shown.

Perlishly, I'd be thinking:

use strict;
use warnings;

local $/;

my $slurp = <DATA>;
$slurp =~ s/\b(\w+)\W\1/$1/go;
print $slurp;

__DATA__
Hi! Hi, same same? word twice twice, as as here here! ! ,123 123 need
need as here

Bear in mind though - a lot of pattern matching is line oriented, so you've got to be careful if you cross line boundaries. If you can exclude that case, then you've got an easier job because you can parse one line at a time. I'm not doing that, so you'll end up reading the whole file into memory.

Find repeated words in a text

Related

Recent Posts