Programmatically install NLTK corpora / models, i.e. without the GUI downloader?
Solution 1:
The NLTK site does list a command line interface for downloading packages and collections at the bottom of this page :
http://www.nltk.org/data
The command line usage varies by which version of Python you are using, but on my Python2.6 install I noticed I was missing the 'spanish_grammar' model and this worked fine:
python -m nltk.downloader spanish_grammars
You mention listing the project's corpus and model requirements and while I'm not sure of a way to automagically do that, I figured I would at least share this.
Solution 2:
To install all NLTK corpora & models:
python -m nltk.downloader all
Alternatively, on Linux, you can use:
sudo python -m nltk.downloader -d /usr/local/share/nltk_data all
Replace all
by popular
if you just want to list the most popular corpora & models.
You may also browse the corpora & models through the command line:
mlee@server:/scratch/jjylee/tests$ sudo python -m nltk.downloader
[sudo] password for jjylee:
NLTK Downloader
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> d
Download which package (l=list; x=cancel)?
Identifier> l
Packages:
[ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
[ ] basque_grammars..... Grammars for Basque
[ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
[ ] book_grammars....... Grammars from NLTK Book
[ ] cess_esp............ CESS-ESP Treebank
[ ] chat80.............. Chat-80 Data Files
[ ] city_database....... City Database
[ ] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)
[ ] comparative_sentences Comparative Sentence Dataset
[ ] comtrans............ ComTrans Corpus Sample
[ ] conll2000........... CONLL 2000 Chunking Corpus
[ ] conll2002........... CONLL 2002 Named Entity Recognition Corpus
[ ] conll2007........... Dependency Treebanks from CoNLL 2007 (Catalan
and Basque Subset)
[ ] crubadan............ Crubadan Corpus
[ ] dependency_treebank. Dependency Parsed Treebank
[ ] europarl_raw........ Sample European Parliament Proceedings Parallel
Corpus
[ ] floresta............ Portuguese Treebank
[ ] framenet_v15........ FrameNet 1.5
Hit Enter to continue:
[ ] framenet_v17........ FrameNet 1.7
[ ] gazetteers.......... Gazeteer Lists
[ ] genesis............. Genesis Corpus
[ ] gutenberg........... Project Gutenberg Selections
[ ] hmm_treebank_pos_tagger Treebank Part of Speech Tagger (HMM)
[ ] ieer................ NIST IE-ER DATA SAMPLE
[ ] inaugural........... C-Span Inaugural Address Corpus
[ ] indian.............. Indian Language POS-Tagged Corpus
[ ] jeita............... JEITA Public Morphologically Tagged Corpus (in
ChaSen format)
[ ] kimmo............... PC-KIMMO Data Files
[ ] knbc................ KNB Corpus (Annotated blog corpus)
[ ] large_grammars...... Large context-free and feature-based grammars
for parser comparison
[ ] lin_thesaurus....... Lin's Dependency Thesaurus
[ ] mac_morpho.......... MAC-MORPHO: Brazilian Portuguese news text with
part-of-speech tags
[ ] machado............. Machado de Assis -- Obra Completa
[ ] masc_tagged......... MASC Tagged Corpus
[ ] maxent_ne_chunker... ACE Named Entity Chunker (Maximum entropy)
[ ] moses_sample........ Moses Sample Models
Hit Enter to continue: x
Download which package (l=list; x=cancel)?
Identifier> conll2002
Downloading package conll2002 to
/afs/mit.edu/u/m/mlee/nltk_data...
Unzipping corpora/conll2002.zip.
---------------------------------------------------------------------------
d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader>
Solution 3:
In addition to the command line option already mentioned, you can programmatically install NLTK data in your Python script by adding an argument to the download()
function.
See the help(nltk.download)
text, specifically:
Individual packages can be downloaded by calling the ``download()`` function with a single argument, giving the package identifier for the package that should be downloaded: >>> download('treebank') # doctest: +SKIP [nltk_data] Downloading package 'treebank'... [nltk_data] Unzipping corpora/treebank.zip.
I can confirm that this works for downloading one package at a time, or when passed a list
or tuple
.
>>> import nltk
>>> nltk.download('wordnet')
[nltk_data] Downloading package 'wordnet' to
[nltk_data] C:\Users\_my-username_\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\wordnet.zip.
True
You may also try to download an already downloaded package without problems:
>>> nltk.download('wordnet')
[nltk_data] Downloading package 'wordnet' to
[nltk_data] C:\Users\_my-username_\AppData\Roaming\nltk_data...
[nltk_data] Package wordnet is already up-to-date!
True
Also, it appears the function returns a boolean value that you can use to see whether or not the download succeeded:
>>> nltk.download('not-a-real-name')
[nltk_data] Error loading not-a-real-name: Package 'not-a-real-name'
[nltk_data] not found in index
False
Solution 4:
I've managed to install the corpora and models inside a custom directory using the following code:
import nltk
nltk.download(info_or_id="popular", download_dir="/path/to/dir")
nltk.data.path.append("/path/to/dir")
this will install "all" corpora/models inside /path/to/dir
, and will let know NLTK where to look for it (data.path.append
).
You can't «freeze» the data in a requirements file, but you could add this code to your __init__
besides come code to check if the files are already there.