What is a regular expression for parsing out individual sentences?

Solution 1:

Try this @"(\S.+?[.!?])(?=\s+|$)":

string str=@"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D.
Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.";

Regex rx = new Regex(@"(\S.+?[.!?])(?=\s+|$)");
foreach (Match match in rx.Matches(str)) {
    int i = match.Index;
    Console.WriteLine(match.Value);
}

Results:

Hello world!
How are you?
I am fine.
This is a difficult sentence because I use I.D.
Newlines should also be accepted.
Numbers should not cause sentence breaks, like 1.23.

For complicated ones, of course, you will need a real parser like SharpNLP or NLTK. Mine is just a quick and dirty one.

Here is the SharpNLP info, and features:

SharpNLP is a collection of natural language processing tools written in C#. Currently it provides the following NLP tools:

  • a sentence splitter
  • a tokenizer
  • a part-of-speech tagger
  • a chunker (used to "find non-recursive syntactic annotations such as noun phrase chunks")
  • a parser
  • a name finder
  • a coreference tool
  • an interface to the WordNet lexical database

Solution 2:

var str = @"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D.
Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.";

Regex.Split(str, @"(?<=[.?!])\s+").Dump();

I tested this in LINQPad.

Solution 3:

It is impossible to use regexes to parse natural language. What is the end of a sentence? A period can occur in many places (e.g. e.g.). You should use a natural language parsing toolkit such as OpenNLP or NLTK. Unfortunately there are very few, if any, offerings in C#. You may therefore have to create a webservice or otherwise link into C#.

Note that it will cause problems in the future if you rely on exact whitespace as in "I.D.". You'll soon find examples that break your regex. For example most people put spaces after their intials.

There is an excellent summary of Open and commercial offerings in WP (http://en.wikipedia.org/wiki/Natural_language_processing_toolkits). We have used several of them. It's worth the effort.

[You use the word "train". This is normally associated with machine-learning (which is one approach to NLP and has been used for sentence-splitting). Indeed the toolkits I have mentioned include machine learning. I suspect that wasn't what you meant - rather that you would evolve your expression through heuristics. Don't!]

Solution 4:

This is not really possible with only regular expressions, unless you know exactly which "difficult" tokens you have, such as "i.d.", "Mr.", etc. For example, how many sentences is "Please show your I.D, Mr. Bond."? I'm not familiar with any C#-implementations, but I've used NLTK's Punkt tokenizer. Probably should not be too hard to re-implement.