Regular expression to find and remove duplicate words

As said by others, you need more than a regex to keep track of words:

var words = new HashSet<string>();
string text = "I like the environment. The environment is good.";
text = Regex.Replace(text, "\\w+", m =>
                     words.Add(m.Value.ToUpperInvariant())
                         ? m.Value
                         : String.Empty);

This seems to work for me

(\b\S+\b)(?=.*\1)

Matches like so

apple apple orange  
orange red blue green orange green blue  
pirates ninjas cowboys ninjas pirates

Well, Jeff has shown me how to use the magic of in-expression backreferences and the global modifier to make this one happen, so my original answer is inoperative. You should all go vote for Jeff's answer. However, for posterity I'll note that there's a tricky little regex engine sensitivity issue in this one, and if you were using Perl-flavored regex, you would need to do this:

\b(\S+)\b(?=.*\b\1\b.*)

instead of Jeff's answer, because C#'s regex will effectively capture \b in \1 but PCRE will not.

Regular expressions would be a poor choice of "tools" to solve this problem. Perhaps the following could work:

HashSet<string> corpus = new HashSet<string>();
char[] split = new char[] { ' ', '\t', '\r', '\n', '.', ';', ',', ':', ... };

foreach (string line in inputLines)
{
    string[] parts = line.Split(split, StringSplitOptions.RemoveEmptyEntries);
    foreach (string part in parts)
    {
        corpus.Add(part.ToUpperInvariant());
    }
}

// 'corpus' now contains all of the unique tokens

EDIT: This is me making a big assumption that you're "lexing" for some sort of analysis like searching.

Have a look at backreferences:
http://msdn.microsoft.com/en-us/library/thwdfzxy(VS.71).aspx

This a regex that will find doubled words. But it will match only one word per match. So you have to use it more than once.

new Regex( @"(.*)\b(\w+)\b(.*)(\2)(.*)", RegexOptions.IgnoreCase );

Of course this is not the best solution (see other answers, which propose not to use a regex at all). But you asked for a regex - here is one. Maybe just the idea helps you ...

Regular expression to find and remove duplicate words

Related

Recent Posts