How to find pyspark dataframe memory usage?

Solution 1:

Try to use the _to_java_object_rdd() function:

import py4j.protocol  
from py4j.protocol import Py4JJavaError  
from py4j.java_gateway import JavaObject  
from py4j.java_collections import JavaArray, JavaList

from pyspark import RDD, SparkContext  
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer

# your dataframe what you'd estimate
df

# Helper function to convert python object to Java objects
def _to_java_object_rdd(rdd):  
    """ Return a JavaRDD of Object by unpickling
    It will convert each Python object into Java object by Pyrolite, whenever the
    RDD is serialized in batch or not.
    """
    rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
    return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)

# First you have to convert it to an RDD 
JavaObj = _to_java_object_rdd(df.rdd)

# Now we can run the estimator
sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)

Solution 2:

I have something in mind, its just a rough estimation. as far as i know spark doesn't have a straight forward way to get dataframe memory usage, But Pandas dataframe does. so what you can do is.

  1. select 1% of data sample = df.sample(fraction = 0.01)
  2. pdf = sample.toPandas()
  3. get pandas dataframe memory usage by pdf.info()
  4. Multiply that values by 100, this should give a rough estimate of your whole spark dataframe memory usage.
  5. Correct me if i am wrong :|