Split string into sentences

Solution 1:

Parsing sentences is far from being a trivial task, even for latin languages like English. A naive approach like the one you outline in your question will fail often enough that it will prove useless in practice.

A better approach is to use a BreakIterator configured with the right Locale.

BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a test. This is a T.L.A. test. Now with a Dr. in it.";
iterator.setText(source);
int start = iterator.first();
for (int end = iterator.next();
    end != BreakIterator.DONE;
    start = end, end = iterator.next()) {
  System.out.println(source.substring(start,end));
}

Yields the following result:

  1. This is a test.
  2. This is a T.L.A. test.
  3. Now with a Dr. in it.

Solution 2:

It will be difficult to get a regular expression to work in all cases, but to fix your immediate problem you can use a lookbehind:

String sResult = "This is a test. This is a T.L.A. test.";
String[] sSentence = sResult.split("(?<=[a-z])\\.\\s+");

Result:

This is a test
This is a T.L.A. test.

Note that there are abbrevations that do not end with capital letters, such as abbrev., Mr., etc... And there are also sentences that don't end in periods!

Solution 3:

If you can, use a natural language processing tool, such as LingPipe. There are many subtleties which will be very hard to catch using regular expressions, e.g., (e.g. :-)), Mr., abbreviations, ellipsis (...), et cetera.

There is a very easy to follow tutorial on Sentence Detection in the LingPipe website.

Solution 4:

Late response but for future visitors such as me and after a long time searching. Use OpenNlP model, that was the best option in my case and it worked with all the text samples here including crucial one mentioned by @nbz in the comment,

My friend, Mr. Jones, has a new dog. This is a test. This is a T.L.A. test. Now with a Dr. in it."

Separated by a line space:

My friend, Mr. Jones, has a new dog.
This is a test.
This is a T.L.A. test.
Now with a Dr. in it.

You need the .jar libraries to import into your project as well as the trained model en-sent.bin.

This is a tutorial which can easily integrate you into a quick and efficient run:

https://www.tutorialkart.com/opennlp/sentence-detection-example-in-opennlp/

And one for setup-ing in eclipse:

https://www.tutorialkart.com/opennlp/how-to-setup-opennlp-java-project/

This is how the code looks like:

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
 
import com.fasterxml.jackson.databind.exc.InvalidFormatException;
 
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
 
/**
* Sentence Detection Example in openNLP using Java
* @author tutorialkart
*/
public class SentenceDetectExample {
 
    public static void main(String[] args) {
        try {
            new SentenceDetectExample().sentenceDetect();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
 
    /**
     * This method is used to detect sentences in a paragraph/string
     * @throws InvalidFormatException
     * @throws IOException
     */
    public void sentenceDetect() throws InvalidFormatException, IOException {
        String paragraph = "This is a statement. This is another statement. Now is an abstract word for time, that is always flying.";
 
        // refer to model file "en-sent,bin", available at link http://opennlp.sourceforge.net/models-1.5/
        InputStream is = new FileInputStream("en-sent.bin");
        SentenceModel model = new SentenceModel(is);
        
        // feed the model to SentenceDetectorME class
        SentenceDetectorME sdetector = new SentenceDetectorME(model);
        
        // detect sentences in the paragraph
        String sentences[] = sdetector.sentDetect(paragraph);
 
        // print the sentences detected, to console
        for(int i=0;i<sentences.length;i++){
            System.out.println(sentences[i]);
        }
        is.close();
    }
}

Since you implement the libraries it works offline too which is a big plus as the correct answer by @Julien Silland says it's not a straight-forward process and having a trained model do it for you is the best option.