Window function acts not as expected when I use Order By (PySpark)

Solution 1:

The simple reason is that the default window range/row spec is Window.UnboundedPreceding to Window.CurrentRow, which means that the max is taken from the first row in that partition to the current row, NOT the last row of the partition.

This is a common gotcha. (you can replace .max() with sum() and see what output you get. It also changes depending on how you order the partition.)

To solve this, you can specify that you want the max of each partition to always be calculated using the full window partition, like so:

window_spec = Window.partitionBy(df['CATEGORY']).orderBy(df['REVENUE']).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

revenue_difference = F.max(df['REVENUE']).over(window_spec)

df.select(
  df['CATEGORY'],
  df['REVENUE'],
  revenue_difference.alias("revenue_difference")).show()
+----------+-------+------------------+
|  CATEGORY|REVENUE|revenue_difference|
+----------+-------+------------------+
|    Tablet|   6500|              6500|
|    Tablet|   5500|              6500|
|    Tablet|   4500|              6500|
|    Tablet|   3000|              6500|
|    Tablet|   2500|              6500|
|    Tablet|   1500|              6500|
|Cell Phone|   6000|              6000|
|Cell Phone|   6000|              6000|
|Cell Phone|   5000|              6000|
|Cell Phone|   3000|              6000|
|Cell Phone|   3000|              6000|
+----------+-------+------------------+