Shipping Python modules in pyspark to other nodes
How can I ship C compiled modules (for example, python-Levenshtein) to each node in a Spark cluster?
I know that I can ship Python files in Spark using a standalone Python script (example code below):
from pyspark import SparkContext
sc = SparkContext("local", "App Name", pyFiles=['MyFile.py', 'MyOtherFile.py'])
But in situations where there is no '.py', how do I ship the module?
Solution 1:
If you can package your module into a .egg
or .zip
file, you should be able to list it in pyFiles
when constructing your SparkContext (or you can add it later through sc.addPyFile).
For Python libraries that use setuptools, you can run python setup.py bdist_egg
to build an egg distribution.
Another option is to install the library cluster-wide, either by using pip/easy_install on each machine or by sharing a Python installation over a cluster-wide filesystem (like NFS).
Solution 2:
There are two main options here:
-
If it's a single file or a
.zip
/.egg
, pass it toSparkContext.addPyFile
. -
Insert
pip install
into a bootstrap code for the cluster's machines.- Some cloud platforms (DataBricks in this case) have UI to make this easier.
People also suggest using python shell
to test if the module is present on the cluster.