Spark dataframe to pandas profiling
I am trying to achieve a data profiling with pandas-profiling library. i am fetching data directly from hive. this is the error i am receiving
Py4JJavaError: An error occurred while calling o114.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 14.0 failed 4 times, most recent failure: Lost task 2.3 in stage 14.0 (TID 65, bdgtr026x30h4.nam.nsroot.net, executor 11): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 15823824. To avoid this, increase spark.kryoserializer.buffer.max value.
i tried to set my spark on jupyter notebook in python but i am receiving the same error
spark.conf.set("spark.kryoserializer.buffer.max", "512")
spark.conf.set('spark.kryoserializer.buffer.max.mb', 'val')
based on my code, am imissing any steps?
df = spark.sql('SELECT id,acct from tablename').cache()
report = ProfileReport(df.toPandas())
Instead of setting the configuration in jupyter set the configuration while creating the spark session as once the session is created the configuration doesn't changes.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.kryoserializer.buffer.max", "512m") \
.config('spark.kryoserializer.buffer', '512k') \
.getOrCreate()
You can get the properties detail here