AWS EMR - ModuleNotFoundError: No module named 'pyarrow'
Solution 1:
In EMR python3 is not resolved by default. You have to make it explicit. One way to do it is to pass a config.json
file as you're creating the cluster. It's available in the Edit software settings
section in AWS EMR UI. A sample json file looks something like this.
[
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3"
}
}
]
},
{
"Classification": "yarn-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3"
}
}
]
}
]
Also you need to have the pyarrow
module installed in all core nodes, not only in the master. For that you can use a bootstrap script while creating the cluster in AWS. Again, a sample bootstrap script can be as simple as something like this:
#!/bin/bash
sudo python3 -m pip install pyarrow==0.13.0
Solution 2:
There are two options in your case:
one is to make sure the python env is correct on every machines:
set the
PYSPARK_PYTHON
to your python interpreter that has installed the third part module such aspyarrow
. you can usetype -a python
to check how many python there is on your slave node.if the python interpreter path are all the same on every nodes, you can set
PYSPARK_PYTHON
inspark-env.sh
then copy to every other nodes. read this for more: https://spark.apache.org/docs/2.4.0/spark-standalone.html
another option is to add argument on spark-submit
:
you have to package your extra module to a
zip
oregg
file first.then type
spark-submit --py-files pyarrow.zip your_code.py
. in this way, spark will transport your module automatically to every other nodes. https://spark.apache.org/docs/latest/submitting-applications.html
I hope these helped.