Select a column value with at least two records with a condition (PYSPARK)

Solution 1:

Use conditional sum aggregation:

import pyspark.sql.functions as F

df = spark.createDataFrame([
    ("s1", "2016-01-01", 20.5), ("s2", "2016-01-01", 30.1), ("s1", "2016-01-02", 60.2),
    ("s2", "2016-01-02", 20.4), ("s1", "2016-01-03", 55.5), ("s2", "2016-01-03", 52.5)
], ["sensorId", "date", "PM10"])

df1 = df.groupBy("sensorId").agg(
    F.sum(F.when(F.col("PM10") > 50., 1)).alias("count")
).filter("count > 1")

df1.show()
#+--------+-----+
#|sensorId|count|
#+--------+-----+
#|      s1|    2|
#+--------+-----+