How to use Wordnet 3.1 with NLTK on Python?

Solution 1:

After a lot of searching and trial and error, I was able to use Wordnet 3.1 on NLTK (Python). I tweaked this gist to make it work. I am providing the details below.

I divided the code provided in the gist in 3 parts.

Part 1. download_extract.py

import os

nltkdata_wn = '/path/to/nltk_data/corpora/wordnet/'
wn31 = "http://wordnetcode.princeton.edu/wn3.1.dict.tar.gz"

if not os.path.exists(nltkdata_wn+'_3.0'):
    os.mkdir(nltkdata_wn+'_3.0')
os.system('mv '+nltkdata_wn+"* "+nltkdata_wn+"_3.0/")

if not os.path.exists('wn3.1.dict.tar.gz'):
    os.system('wget '+wn31)

os.system("tar zxf wn3.1.dict.tar.gz -C "+nltkdata_wn)
os.system("mv "+nltkdata_wn+"dict/* "+nltkdata_wn)
os.rmdir(nltkdata_wn + 'dict')

This is used to back up the existing Wordnet 3.0 folder from wordnet to wordnet_3.0, download the Wordnet 3.1 database, and put it in folder wordnet. Since I am on a Windows system, I did this manually.

Part 2. create_lexnames.py

import os

nltkdata_wn = '/path/to/nltk_data/corpora/wordnet/'
dbfiles = nltkdata_wn+'dbfiles'

with open(nltkdata_wn+'lexnames', 'w') as fout:
    for i,j in enumerate(sorted(os.listdir(dbfiles))):
        pos = j.partition('.')[0]
        if pos == "noun":
            syncat = 1
        elif pos == "verb":
            syncat = 2
        elif pos == "adj":
            syncat = 3
        elif pos == "adv":
            syncat = 4
        elif j == "cntlist":
            syncat = "cntlist"
        fout.write("\t".join([str(i).zfill(2),j,str(syncat)])+"\n")

This creates the required lexnames file in the wordnet folder.

Part 3. testing_wn31.py

from nltk.corpus import wordnet as wn

nltkdata_wn = '/path/to/nltk_data/corpora/wordnet/'

# Checking generated lexnames file.
for i, line in enumerate(open(nltkdata_wn + 'lexnames','r')):
    index, lexname, _ = line.split()
    ##print line.split(), int(index), i
    assert int(index) == i

# Testing wordnet function.
print(wn.synsets('dog'))
for i in wn.all_synsets():
    print(i, i.pos(), i.definition())

This tested the generated lexname file and also tested if the wordnet functions are working fine.

Once I am done with this procedure, I ran following code in python and found that it is actually running version 3.1

>>> from nltk.corpus import wordnet
>>> wordnet.get_version()
'3.1'

A Word of Caution

Once you replace the Wordnet 3.1 database, you'll notice that if you run the following code

>>> import nltk
>>> nltk.download()

in the download dialog box, you will see that under Corpora tab, Wordnet will be shown as out of date, you should not try to update it as it will either replace the wordnet to version 3.0 or break it.