PySpark first and last function over a partition in one go
When using orderBy
with Window you need to specify frame boundaries as ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
otherwise the last
function will only get last value between UNBOUNDED PRECEDING
and CURRENT ROW
(the default frame bounds when order by is specified).
Try this:
w = Window.partitionBy('id', 'a1', 'a2').orderBy('c1') \
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df = df.withColumn("First_c1", first("c1").over(w)) \
.withColumn("First_c3", first("c3").over(w)) \
.withColumn("Last_c2", last("c2").over(w))
df.groupby("id", "a1", "a2")\
.agg(first("First_c1").alias("c1"),
first("Last_c2").alias("c2"),
first("First_c3").alias("c3")
).show()