How to find spark RDD/Dataframe size?
If you are simply looking to count the number of rows in the rdd
, do:
val distFile = sc.textFile(file)
println(distFile.count)
If you are interested in the bytes, you can use the SizeEstimator
:
import org.apache.spark.util.SizeEstimator
println(SizeEstimator.estimate(distFile))
https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html
Yes Finally I got the solution. Include these libraries.
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd
How to find the RDD Size:
def calcRDDSize(rdd: RDD[String]): Long = {
rdd.map(_.getBytes("UTF-8").length.toLong)
.reduce(_+_) //add the sizes together
}
Function to find DataFrame size: (This function just convert DataFrame to RDD internally)
val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path
val rddOfDataframe = dataFrame.rdd.map(_.toString())
val size = calcRDDSize(rddOfDataframe)