NLTK Named Entity Recognition with Custom Data

Solution 1:

Are you committed to using NLTK/Python? I ran into the same problems as you, and had much better results using Stanford's named-entity recognizer: http://nlp.stanford.edu/software/CRF-NER.shtml. The process for training the classifier using your own data is very well-documented in the FAQ.

If you really need to use NLTK, I'd hit up the mailing list for some advice from other users: http://groups.google.com/group/nltk-users.

Hope this helps!

Solution 2:

You can easily use the Stanford NER alongwith nltk. The python script is like

from nltk.tag.stanford import NERTagger
import os
java_path = "/Java/jdk1.8.0_45/bin/java.exe"
os.environ['JAVAHOME'] = java_path
st = NERTagger('../ner-model.ser.gz','../stanford-ner.jar')
tagging = st.tag(text.split())   

To train your own data and to create a model you can refer to the first question on Stanford NER FAQ.

The link is http://nlp.stanford.edu/software/crf-faq.shtml

Solution 3:

I also had this issue, but I managed to work it out. You can use your own training data. I documented the main requirements/steps for this in my github repository.

I used NLTK-trainer, so basicly you have to get the training data in the right format (token NNP B-tag), and run the training script. Check my repository for more info.