Pyspark dataframe column value dependent on value from another row
Solution 1:
You can use first
function with ignorenulls=True
over a Window. But you need to identify groups of manufacturer
in order to partition by that group
.
As you didn't give any ID
column I'm using monotonically_increasing_id
and a cumulative conditional sum to create a group column:
from pyspark.sql import functions as F
df1 = df.withColumn(
"row_id",
F.monotonically_increasing_id()
).withColumn(
"group",
F.sum(F.when(F.col("manufacturer") == "Factory", 1)).over(Window.orderBy("row_id"))
).withColumn(
"product_id",
F.when(
F.col("product_id") == 0,
F.first("product_id", ignorenulls=True).over(Window.partitionBy("group").orderBy("row_id"))
).otherwise(F.col("product_id"))
).drop("row_id", "group")
df1.show()
#+-------------+----------+
#| manufacturer|product_id|
#+-------------+----------+
#| Factory| AE222|
#|Sub-Factory-1| AE222|
#|Sub-Factory-2| AE222|
#| Factory| AE333|
#|Sub-Factory-1| AE333|
#|Sub-Factory-2| AE333|
#+-------------+----------+