Total zero count across all columns in a pyspark dataframe
I need to find the percentage of zero across all columns in a pyspark dataframe. How to find the count of zero across each columns in the dataframe?
P.S: I have tried converting the dataframe into a pandas dataframe and used value_counts. But inferring it's observation is not possible for a large dataset.
Solution 1:
"How to find the count of zero across each columns in the dataframe?"
First:
import pyspark.sql.functions as F
df_zero = df.select([F.count(F.when(df[c] == 0, c)).alias(c) for c in df.columns])
Second: you can then see the count (compared to .show(), this gives you better view. And the speed is not much different):
df_zero.limit(2).toPandas().head()
Enjoy! :)