How to Capitalize Locations in a List Python
What you're looking for is Named Entity Recognition (NER). NLTK does support a named entity function: ne_chunk
, which can be used for this purpose. I'll give a demonstration:
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
# Tokenize str -> List[str]
tok_sent = word_tokenize(sentence)
# Tag List[str] -> List[Tuple[str, str]]
pos_sent = pos_tag(tok_sent)
print(pos_sent)
# Chunk this tagged data
tree_sent = ne_chunk(pos_sent)
# This returns a Tree, which we pretty-print
tree_sent.pprint()
locations = []
# All subtrees at height 2 will be our named entities
for named_entity in tree_sent.subtrees(lambda t: t.height() == 2):
# Extract named entity type and the chunk
ne_type = named_entity.label()
chunk = " ".join([tagged[0] for tagged in named_entity.leaves()])
print(ne_type, chunk)
if ne_type == "GPE":
locations.append(chunk)
print(locations)
This outputs (with my comments added):
# pos_tag output:
[('In', 'IN'), ('the', 'DT'), ('wake', 'NN'), ('of', 'IN'), ('a', 'DT'), ('string', 'NN'), ('of', 'IN'), ('abuses', 'NNS'), ('by', 'IN'), ('New', 'NNP'), ('York', 'NNP'), ('police', 'NN'), ('officers', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('1990s', 'CD'), (',', ','), ('Loretta', 'NNP'), ('E.', 'NNP'), ('Lynch', 'NNP'), (',', ','), ('the', 'DT'), ('top', 'JJ'), ('federal', 'JJ'), ('prosecutor', 'NN'), ('in', 'IN'), ('Brooklyn', 'NNP'), (',', ','), ('spoke', 'VBD'), ('forcefully', 'RB'), ('about', 'IN'), ('the', 'DT'), ('pain', 'NN'), ('of', 'IN'), ('a', 'DT'), ('broken', 'JJ'), ('trust', 'NN'), ('that', 'IN'), ('African-Americans', 'NNP'), ('felt', 'VBD'), ('and', 'CC'), ('said', 'VBD'), ('the', 'DT'), ('responsibility', 'NN'), ('for', 'IN'), ('repairing', 'VBG'), ('generations', 'NNS'), ('of', 'IN'), ('miscommunication', 'NN'), ('and', 'CC'), ('mistrust', 'NN'), ('fell', 'VBD'), ('to', 'TO'), ('law', 'NN'), ('enforcement', 'NN'), ('.', '.')]
# ne_chunk output:
(S
In/IN
the/DT
wake/NN
of/IN
a/DT
string/NN
of/IN
abuses/NNS
by/IN
(GPE New/NNP York/NNP)
police/NN
officers/NNS
in/IN
the/DT
1990s/CD
,/,
(PERSON Loretta/NNP E./NNP Lynch/NNP)
,/,
the/DT
top/JJ
federal/JJ
prosecutor/NN
in/IN
(GPE Brooklyn/NNP)
,/,
spoke/VBD
forcefully/RB
about/IN
the/DT
pain/NN
of/IN
a/DT
broken/JJ
trust/NN
that/IN
African-Americans/NNP
felt/VBD
and/CC
said/VBD
the/DT
responsibility/NN
for/IN
repairing/VBG
generations/NNS
of/IN
miscommunication/NN
and/CC
mistrust/NN
fell/VBD
to/TO
law/NN
enforcement/NN
./.)
# All entities found
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn
# All GPE (Geo-Political Entity)
['New York', 'Brooklyn']
However, it should be noted that the performance of this ne_chunk
seems to fall significantly if we remove all capitalisation from the sentence.
We can perform similar stuff with spaCy:
import spacy
import en_core_web_sm
from pprint import pprint
sentence = "In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
nlp = en_core_web_sm.load()
doc = nlp(sentence)
pprint([(X.text, X.label_) for X in doc.ents])
# Then, we can take only `GPE`:
print([X.text for X in doc.ents if X.label_ == "GPE"])
Which outputs:
[('New York', 'GPE'),
('the 1990s', 'DATE'),
('Loretta E. Lynch', 'PERSON'),
('Brooklyn', 'GPE'),
('African-Americans', 'NORP')]
['New York', 'Brooklyn']
This output (for GPE's) is identical to NLTK's, but the reason I mention spaCy is because unlike NLTK, it also works on fully lower-case sentences. If I lower-case my test sentence, then the output becomes:
[('new york', 'GPE'),
('the 1990s', 'DATE'),
('loretta e. lynch', 'PERSON'),
('brooklyn', 'GPE'),
('african-americans', 'NORP')]
['new york', 'brooklyn']
This allows you to title-case these words in an otherwise lower-case sentence.