dataframe: how to groupBy/count then filter on count in Scala

Solution 1:

When you pass a string to the filter function, the string is interpreted as SQL. Count is a SQL keyword and using count as a variable confuses the parser. This is a small bug (you can file a JIRA ticket if you want to).

You can easily avoid this by using a column expression instead of a String:

df.groupBy("x").count()
  .filter($"count" >= 2)
  .show()

Solution 2:

So, is that a behavior to expect, a bug

Truth be told I am not sure. It looks like parser is interpreting count not as a column name but a function and expects following parentheses. Looks like a bug or at least a serious limitation of the parser.

is there a canonical way to go around?

Some options have been already mentioned by Herman and mattinbits so here more SQLish approach from me:

import org.apache.spark.sql.functions.count

df.groupBy("x").agg(count("*").alias("cnt")).where($"cnt"  > 2)

Solution 3:

I think a solution is to put count in back ticks

.filter("`count` >= 2")

http://mail-archives.us.apache.org/mod_mbox/spark-user/201507.mbox/%3C8E43A71610EAA94A9171F8AFCC44E351B48EDF@fmsmsx124.amr.corp.intel.com%3E