How can I include a python package with Hadoop streaming job?
I am trying include a python package (NLTK) with a Hadoop streaming job, but am not sure how to do this without including every file manually via the CLI argument, "-file".
Edit: One solution would be to install this package on all the slaves, but I don't have that option currently.
Just came across this gem of a solution: http://blog.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/
first create zip w/ the libraries desired
zip -r nltkandyaml.zip nltk yaml
mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod
next, include via Hadoop stream "-file" argument:
hadoop -file nltkandyaml.zip
finally, load the libaries via python:
import zipimport
importer = zipimport.zipimporter('nltkandyaml.mod')
yaml = importer.load_module('yaml')
nltk = importer.load_module('nltk')
Additionally, this page summarizes how to include a corpus: http://www.xcombinator.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/
download and unzip the wordnet corpus
cd wordnet
zip -r ../wordnet-flat.zip *
in python:
wn = WordNetCorpusReader(nltk.data.find('lib/wordnet-flat.zip'))
I would zip up the package into a .tar.gz
or a .zip
and pass the entire tarball or archive in a -file
option to your hadoop command. I've done this in the past with Perl but not Python.
That said, I would think this would still work for you if you use Python's zipimport
at http://docs.python.org/library/zipimport.html, which allows you to import modules directly from a zip.