How to pass -D parameter or environment variable to Spark job?
Change spark-submit
command line adding three options:
--files <location_to_your_app.conf>
--conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app'
--conf 'spark.driver.extraJavaOptions=-Dconfig.resource=app'
Here is my spark program run with addition java option
/home/spark/spark-1.6.1-bin-hadoop2.6/bin/spark-submit \
--files /home/spark/jobs/fact_stats_ad.conf \
--conf spark.executor.extraJavaOptions=-Dconfig.fuction.conf \
--conf 'spark.driver.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH -Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.MostAvailableFirstPolicy -Dconfig.file=/home/spark/jobs/fact_stats_ad.conf' \
--class jobs.DiskDailyJob \
--packages com.databricks:spark-csv_2.10:1.4.0 \
--jars /home/spark/jobs/alluxio-core-client-1.2.0-RC2-jar-with-dependencies.jar \
--driver-memory 2g \
/home/spark/jobs/convert_to_parquet.jar \
AD_COOKIE_REPORT FACT_AD_STATS_DAILY | tee /data/fact_ad_stats_daily.log
as you can see
the custom config file
--files /home/spark/jobs/fact_stats_ad.conf
the executor java options
--conf spark.executor.extraJavaOptions=-Dconfig.fuction.conf
the driver java options.
--conf 'spark.driver.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH -Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.MostAvailableFirstPolicy -Dconfig.file=/home/spark/jobs/fact_stats_ad.conf'
Hope it can helps.
I Had a lot of problems with passing -D parameters to spark executors and the driver, I've added a quote from my blog post about it:
"
The right way to pass the parameter is through the property:
“spark.driver.extraJavaOptions
” and “spark.executor.extraJavaOptions
”:
I’ve passed both the log4J configurations property and the parameter that I needed for the configurations. (To the Driver I was able to pass only the log4j configuration).
For example (was written in properties file passed in spark-submit with “—properties-file”):
“
spark.driver.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/log4j.properties -
spark.executor.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/log4j.properties -Dapplication.properties.file=hdfs:///some/path/on/hdfs/app.properties
spark.application.properties.file hdfs:///some/path/on/hdfs/app.properties
“
You can read my blog post about overall configurations of spark. I'm am running on Yarn as well.
--files <location_to_your_app.conf>
--conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app'
--conf 'spark.driver.extraJavaOptions=-Dconfig.resource=app'
if you write in this way, the later --conf
will overwrite the previous one, you can verify this by looking at sparkUI after job started under Environment
tab.
so the correct way is to put the options under same line like this:
--conf 'spark.executor.extraJavaOptions=-Da=b -Dc=d'
if you do this, you can find all your settings will be shown under sparkUI.
I am starting my Spark application via a spark-submit command launched from within another Scala application. So I have an Array like
Array(".../spark-submit", ..., "--conf", confValues, ...)
where confValues
is:
- for
yarn-cluster
mode:"spark.driver.extraJavaOptions=-Drun.mode=production -Dapp.param=..."
- for
local[*]
mode:"run.mode=development"
It is a bit tricky to understand where (not) to escape quotes and spaces, though. You can check the Spark web interface for system property values.