spark 2.1.0 session config settings (pyspark)
Solution 1:
You aren't actually overwriting anything with this code. Just so you can see for yourself try the following.
As soon as you start pyspark shell type:
sc.getConf().getAll()
This will show you all of the current config settings. Then try your code and do it again. Nothing changes.
What you should do instead is create a new configuration and use that to create a SparkContext. Do it like this:
conf = pyspark.SparkConf().setAll([('spark.executor.memory', '8g'), ('spark.executor.cores', '3'), ('spark.cores.max', '3'), ('spark.driver.memory','8g')])
sc.stop()
sc = pyspark.SparkContext(conf=conf)
Then you can check yourself just like above with:
sc.getConf().getAll()
This should reflect the configuration you wanted.
Solution 2:
update configuration in Spark 2.3.1
To change the default spark configurations you can follow these steps:
Import the required classes
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
Get the default configurations
spark.sparkContext._conf.getAll()
Update the default configurations
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])
Stop the current Spark Session
spark.sparkContext.stop()
Create a Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
Solution 3:
You could also set configuration when you start pyspark, just like spark-submit:
pyspark --conf property=value
Here is one example
-bash-4.2$ pyspark
Python 3.6.8 (default, Apr 25 2019, 21:02:35)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0-cdh6.2.0
/_/
Using Python version 3.6.8 (default, Apr 25 2019 21:02:35)
SparkSession available as 'spark'.
>>> spark.conf.get('spark.eventLog.enabled')
'true'
>>> exit()
-bash-4.2$ pyspark --conf spark.eventLog.enabled=false
Python 3.6.8 (default, Apr 25 2019, 21:02:35)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0-cdh6.2.0
/_/
Using Python version 3.6.8 (default, Apr 25 2019 21:02:35)
SparkSession available as 'spark'.
>>> spark.conf.get('spark.eventLog.enabled')
'false'
Solution 4:
Setting 'spark.driver.host' to 'localhost' in the config works for me
spark = SparkSession \
.builder \
.appName("MyApp") \
.config("spark.driver.host", "localhost") \
.getOrCreate()
Solution 5:
I had a very different requirement where I had to check if I am getting parameters of executor and driver memory size and if getting, had to replace config with only changes in executer and driver. Below are the steps:
- Import Libraries
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
- Define Spark and get the default configuration
spark = (SparkSession.builder
.master("yarn")
.appName("experiment")
.config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "false")
.getOrCreate())
conf = spark.sparkContext._conf.getAll()
- Check if executor and driver size exists (I am giving here pseudo code 1 conditional check, rest you can create cases) then use the given configuration based on params or skip to the default configuration.
if executor_mem is not None and driver_mem is not None:
conf = spark.sparkContext._conf.setAll([('spark.executor.memory',executor_mem),('spark.driver.memory',driver_mem)])
spark.sparkContext.stop()
spark = SparkSession.builder.config(conf=conf).getOrCreate()
else:
spark = spark
Don't forget to stop spark context, this will make sure executor and driver memory size have differed as you passed in params. Hope this helps!