Pyspark dataframe column value dependent on value from another row

Solution 1:

You can use first function with ignorenulls=True over a Window. But you need to identify groups of manufacturer in order to partition by that group.

As you didn't give any ID column I'm using monotonically_increasing_id and a cumulative conditional sum to create a group column:

from pyspark.sql import functions as F

df1 = df.withColumn(
    "row_id",
    F.monotonically_increasing_id()
).withColumn(
    "group",
    F.sum(F.when(F.col("manufacturer") == "Factory", 1)).over(Window.orderBy("row_id"))
).withColumn(
    "product_id",
    F.when(
        F.col("product_id") == 0,
        F.first("product_id", ignorenulls=True).over(Window.partitionBy("group").orderBy("row_id"))
    ).otherwise(F.col("product_id"))
).drop("row_id", "group")

df1.show()
#+-------------+----------+
#| manufacturer|product_id|
#+-------------+----------+
#|      Factory|     AE222|
#|Sub-Factory-1|     AE222|
#|Sub-Factory-2|     AE222|
#|      Factory|     AE333|
#|Sub-Factory-1|     AE333|
#|Sub-Factory-2|     AE333|
#+-------------+----------+

Pyspark dataframe column value dependent on value from another row

Solution 1:

Related

Recent Posts