How to check if spark dataframe is empty?
Right now, I have to use df.count > 0
to check if the DataFrame
is empty or not. But it is kind of inefficient. Is there any better way to do that?
PS: I want to check if it's empty so that I only save the DataFrame
if it's not empty
Solution 1:
For Spark 2.1.0, my suggestion would be to use head(n: Int)
or take(n: Int)
with isEmpty
, whichever one has the clearest intent to you.
df.head(1).isEmpty
df.take(1).isEmpty
with Python equivalent:
len(df.head(1)) == 0 # or bool(df.head(1))
len(df.take(1)) == 0 # or bool(df.take(1))
Using df.first()
and df.head()
will both return the java.util.NoSuchElementException
if the DataFrame is empty. first()
calls head()
directly, which calls head(1).head
.
def first(): T = head()
def head(): T = head(1).head
head(1)
returns an Array, so taking head
on that Array causes the java.util.NoSuchElementException
when the DataFrame is empty.
def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)
So instead of calling head()
, use head(1)
directly to get the array and then you can use isEmpty
.
take(n)
is also equivalent to head(n)
...
def take(n: Int): Array[T] = head(n)
And limit(1).collect()
is equivalent to head(1)
(notice limit(n).queryExecution
in the head(n: Int)
method), so the following are all equivalent, at least from what I can tell, and you won't have to catch a java.util.NoSuchElementException
exception when the DataFrame is empty.
df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty
I know this is an older question so hopefully it will help someone using a newer version of Spark.
Solution 2:
I would say to just grab the underlying RDD
. In Scala:
df.rdd.isEmpty
in Python:
df.rdd.isEmpty()
That being said, all this does is call take(1).length
, so it'll do the same thing as Rohan answered...just maybe slightly more explicit?
Solution 3:
I had the same question, and I tested 3 main solution :
(df != null) && (df.count > 0)
-
df.head(1).isEmpty()
as @hulin003 suggest -
df.rdd.isEmpty()
as @Justin Pihony suggest
and of course the 3 works, however in term of perfermance, here is what I found, when executing the these methods on the same DF in my machine, in terme of execution time :
- it takes ~9366ms
- it takes ~5607ms
- it takes ~1921ms
therefore I think that the best solution is df.rdd.isEmpty()
as @Justin Pihony suggest
Solution 4:
Since Spark 2.4.0 there is Dataset.isEmpty
.
It's implementation is :
def isEmpty: Boolean =
withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
}
Note that a DataFrame
is no longer a class in Scala, it's just a type alias (probably changed with Spark 2.0):
type DataFrame = Dataset[Row]
Solution 5:
You can take advantage of the head()
(or first()
) functions to see if the DataFrame
has a single row. If so, it is not empty.