summing common column values by using pattern matching of column names using pyspark
example dataframe:
lens | INV.NO | INV.NO | INV.NO |
---|---|---|---|
First | 1 | 3 | 4 |
Second | 2 | 5 | 3 |
First | 3 | 7 | 2 |
Second | 4 | 8 | 1 |
pattern matching (^INV.NO*) and pick up all 3 columns matching that and create new column plain_sum having the sum using pyspark.
lens | INV.NO | INV.NO | INV.NO | plain_sum |
---|---|---|---|---|
First | 1 | 3 | 4 | 8 |
Second | 2 | 5 | 3 | 10 |
First | 3 | 7 | 2 | 12 |
Second | 4 | 8 | 1 | 13 |
I don't see 'prod' in your column names. You can use a simple list comprehension to get your 'plain' columns in a list, and use reduce
with add
:
from functools import reduce
from operator import add
col_list = [c for c in df.columns if 'plain-' in c]
df = df.withColumn("plain_sum" ,reduce(add, [col(x) for x in col_list]))
df.show(truncate=False)
+------+-------------+-------------+-------------+---------+
|lens |plain-dev 103|plain-pre 104|plain-rte 107|plain_sum|
+------+-------------+-------------+-------------+---------+
|First |1 |3 |4 |8 |
|Second|2 |5 |3 |10 |
|First |3 |7 |2 |12 |
|Second|4 |8 |1 |13 |
+------+-------------+-------------+-------------+---------+