How to tweak the NLTK sentence tokenizer
You need to supply a list of abbreviations to the tokenizer, like so:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc'])
sentence_splitter = PunktSentenceTokenizer(punkt_param)
text = "is THAT what you mean, Mrs. Hussey?"
sentences = sentence_splitter.tokenize(text)
sentences is now:
['is THAT what you mean, Mrs. Hussey?']
Update: This does not work if the last word of the sentence has an apostrophe or a quotation mark attached to it (like Hussey?'). So a quick-and-dirty way around this is to put spaces in front of apostrophes and quotes that follow sentence-end symbols (.!?):
text = text.replace('?"', '? "').replace('!"', '! "').replace('."', '. "')
You can modify the NLTK's pre-trained English sentence tokenizer to recognize more abbreviations by adding them to the set _params.abbrev_types
. For example:
extra_abbreviations = ['dr', 'vs', 'mr', 'mrs', 'prof', 'inc', 'i.e']
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentence_tokenizer._params.abbrev_types.update(extra_abbreviations)
Note that the abbreviations must be specified without the final period, but do include any internal periods, as in 'i.e'
above. For details about the other tokenizer parameters, refer to the relevant documentation.
You can tell the PunktSentenceTokenizer.tokenize
method to include "terminal" double quotes with the rest of the sentence by setting the realign_boundaries
parameter to True
. See the code below for an example.
I do not know a clean way to prevent text like Mrs. Hussey
from being split into two sentences. However, here is a hack which
- mangles all occurrences of
Mrs. Hussey
toMrs._Hussey
, - then splits the text into sentences with
sent_tokenize.tokenize
, - then for each sentence, unmangles
Mrs._Hussey
back toMrs. Hussey
I wish I knew a better way, but this might work in a pinch.
import nltk
import re
import functools
mangle = functools.partial(re.sub, r'([MD]rs?[.]) ([A-Z])', r'\1_\2')
unmangle = functools.partial(re.sub, r'([MD]rs?[.])_([A-Z])', r'\1 \2')
sent_tokenize = nltk.data.load('tokenizers/punkt/english.pickle')
sample = '''"A clam for supper? a cold clam; is THAT what you mean, Mrs. Hussey?" says I, "but that\'s a rather cold and clammy reception in the winter time, ain\'t it, Mrs. Hussey?"'''
sample = mangle(sample)
sentences = [unmangle(sent) for sent in sent_tokenize.tokenize(
sample, realign_boundaries = True)]
print u"\n-----\n".join(sentences)
yields
"A clam for supper?
-----
a cold clam; is THAT what you mean, Mrs. Hussey?"
-----
says I, "but that's a rather cold and clammy reception in the winter time, ain't it, Mrs. Hussey?"