How to find pyspark dataframe memory usage?
Solution 1:
Try to use the _to_java_object_rdd()
function:
import py4j.protocol
from py4j.protocol import Py4JJavaError
from py4j.java_gateway import JavaObject
from py4j.java_collections import JavaArray, JavaList
from pyspark import RDD, SparkContext
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
# your dataframe what you'd estimate
df
# Helper function to convert python object to Java objects
def _to_java_object_rdd(rdd):
""" Return a JavaRDD of Object by unpickling
It will convert each Python object into Java object by Pyrolite, whenever the
RDD is serialized in batch or not.
"""
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
# First you have to convert it to an RDD
JavaObj = _to_java_object_rdd(df.rdd)
# Now we can run the estimator
sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)
Solution 2:
I have something in mind, its just a rough estimation. as far as i know spark doesn't have a straight forward way to get dataframe memory usage, But Pandas dataframe does. so what you can do is.
- select 1% of data
sample = df.sample(fraction = 0.01)
pdf = sample.toPandas()
- get pandas dataframe memory usage by
pdf.info()
- Multiply that values by 100, this should give a rough estimate of your whole spark dataframe memory usage.
- Correct me if i am wrong :|