Select a column value with at least two records with a condition (PYSPARK)
Solution 1:
Use conditional sum aggregation:
import pyspark.sql.functions as F
df = spark.createDataFrame([
("s1", "2016-01-01", 20.5), ("s2", "2016-01-01", 30.1), ("s1", "2016-01-02", 60.2),
("s2", "2016-01-02", 20.4), ("s1", "2016-01-03", 55.5), ("s2", "2016-01-03", 52.5)
], ["sensorId", "date", "PM10"])
df1 = df.groupBy("sensorId").agg(
F.sum(F.when(F.col("PM10") > 50., 1)).alias("count")
).filter("count > 1")
df1.show()
#+--------+-----+
#|sensorId|count|
#+--------+-----+
#| s1| 2|
#+--------+-----+