How do I download NLTK data?

Updated answer:NLTK works for 2.7 well. I had 3.2. I uninstalled 3.2 and installed 2.7. Now it works!!

I have installed NLTK and tried to download NLTK Data. What I did was to follow the instrution on this site: http://www.nltk.org/data.html

I downloaded NLTK, installed it, and then tried to run the following code:

>>> import nltk
>>> nltk.download()

It gave me the error message like below:

Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    nltk.download()
AttributeError: 'module' object has no attribute 'download'
 Directory of C:\Python32\Lib\site-packages

Tried both nltk.download() and nltk.downloader(), both gave me error messages.

Then I used help(nltk) to pull out the package, it shows the following info:

NAME
    nltk

PACKAGE CONTENTS
    align
    app (package)
    book
    ccg (package)
    chat (package)
    chunk (package)
    classify (package)
    cluster (package)
    collocations
    corpus (package)
    data
    decorators
    downloader
    draw (package)
    examples (package)
    featstruct
    grammar
    help
    inference (package)
    internals
    lazyimport
    metrics (package)
    misc (package)
    model (package)
    parse (package)
    probability
    sem (package)
    sourcedstring
    stem (package)
    tag (package)
    test (package)
    text
    tokenize (package)
    toolbox
    tree
    treetransforms
    util
    yamltags

FILE
    c:\python32\lib\site-packages\nltk

I do see Downloader there, not sure why it does not work. Python 3.2.2, system Windows vista.


Solution 1:

TL;DR

To download a particular dataset/models, use the nltk.download() function, e.g. if you are looking to download the punkt sentence tokenizer, use:

$ python3
>>> import nltk
>>> nltk.download('punkt')

If you're unsure of which data/model you need, you can start out with the basic list of data + models with:

>>> import nltk
>>> nltk.download('popular')

It will download a list of "popular" resources, these includes:

<collection id="popular" name="Popular packages">
      <item ref="cmudict" />
      <item ref="gazetteers" />
      <item ref="genesis" />
      <item ref="gutenberg" />
      <item ref="inaugural" />
      <item ref="movie_reviews" />
      <item ref="names" />
      <item ref="shakespeare" />
      <item ref="stopwords" />
      <item ref="treebank" />
      <item ref="twitter_samples" />
      <item ref="omw" />
      <item ref="wordnet" />
      <item ref="wordnet_ic" />
      <item ref="words" />
      <item ref="maxent_ne_chunker" />
      <item ref="punkt" />
      <item ref="snowball_data" />
      <item ref="averaged_perceptron_tagger" />
    </collection>

EDITED

In case anyone is avoiding errors from downloading larger datasets from nltk, from https://stackoverflow.com/a/38135306/610569

$ rm /Users/<your_username>/nltk_data/corpora/panlex_lite.zip
$ rm -r /Users/<your_username>/nltk_data/corpora/panlex_lite
$ python

>>> import nltk
>>> dler = nltk.downloader.Downloader()
>>> dler._update_index()
>>> dler._status_cache['panlex_lite'] = 'installed' # Trick the index to treat panlex_lite as it's already installed.
>>> dler.download('popular')

Updated

From v3.2.5, NLTK has a more informative error message when nltk_data resource is not found, e.g.:

>>> from nltk import word_tokenize
>>> word_tokenize('x')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/l/alvas/git/nltk/nltk/tokenize/__init__.py", line 128, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "/Users//alvas/git/nltk/nltk/tokenize/__init__.py", line 94, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/Users/alvas/git/nltk/nltk/data.py", line 820, in load
    opened_resource = _open(resource_url)
  File "/Users/alvas/git/nltk/nltk/data.py", line 938, in _open
    return find(path_, path + ['']).open()
  File "/Users/alvas/git/nltk/nltk/data.py", line 659, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  Searched in:
    - '/Users/alvas/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

Related

  • To find nltk_data directory (auto-magically), see https://stackoverflow.com/a/36383314/610569

  • To download nltk_data to a different path, see https://stackoverflow.com/a/48634212/610569

  • To config nltk_data path (i.e. set a different path for NLTK to find nltk_data), see https://stackoverflow.com/a/22987374/610569

Solution 2:

Try

nltk.download('all')

this will download all the data and no need to download individually.

Solution 3:

Install Pip: run in terminal : sudo easy_install pip

Install Numpy (optional): run : sudo pip install -U numpy

Install NLTK: run : sudo pip install -U nltk

Test installation: run: python

then type : import nltk

To download the corpus

run : python -m nltk.downloader all

Solution 4:

Do not name your file nltk.py I used the same code and name it nltk, and got the same error as you have, I changed the file name and it went well.