NLTK Named Entity Recognition with Custom Data
Solution 1:
Are you committed to using NLTK/Python? I ran into the same problems as you, and had much better results using Stanford's named-entity recognizer: http://nlp.stanford.edu/software/CRF-NER.shtml. The process for training the classifier using your own data is very well-documented in the FAQ.
If you really need to use NLTK, I'd hit up the mailing list for some advice from other users: http://groups.google.com/group/nltk-users.
Hope this helps!
Solution 2:
You can easily use the Stanford NER alongwith nltk. The python script is like
from nltk.tag.stanford import NERTagger
import os
java_path = "/Java/jdk1.8.0_45/bin/java.exe"
os.environ['JAVAHOME'] = java_path
st = NERTagger('../ner-model.ser.gz','../stanford-ner.jar')
tagging = st.tag(text.split())
To train your own data and to create a model you can refer to the first question on Stanford NER FAQ.
The link is http://nlp.stanford.edu/software/crf-faq.shtml
Solution 3:
I also had this issue, but I managed to work it out. You can use your own training data. I documented the main requirements/steps for this in my github repository.
I used NLTK-trainer, so basicly you have to get the training data in the right format (token NNP B-tag), and run the training script. Check my repository for more info.