PySpark first and last function over a partition in one go

When using orderBy with Window you need to specify frame boundaries as ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING otherwise the last function will only get last value between UNBOUNDED PRECEDING and CURRENT ROW (the default frame bounds when order by is specified).

Try this:

w = Window.partitionBy('id', 'a1', 'a2').orderBy('c1') \
          .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

df = df.withColumn("First_c1", first("c1").over(w)) \
      .withColumn("First_c3", first("c3").over(w)) \
      .withColumn("Last_c2", last("c2").over(w))

df.groupby("id", "a1", "a2")\
  .agg(first("First_c1").alias("c1"),
       first("Last_c2").alias("c2"),
       first("First_c3").alias("c3")
  ).show()

PySpark first and last function over a partition in one go

Related

Recent Posts