summing common column values by using pattern matching of column names using pyspark

example dataframe:

lens	INV.NO	INV.NO	INV.NO
First	1	3	4
Second	2	5	3
First	3	7	2
Second	4	8	1

pattern matching (^INV.NO*) and pick up all 3 columns matching that and create new column plain_sum having the sum using pyspark.

lens	INV.NO	INV.NO	INV.NO	plain_sum
First	1	3	4	8
Second	2	5	3	10
First	3	7	2	12
Second	4	8	1	13

I don't see 'prod' in your column names. You can use a simple list comprehension to get your 'plain' columns in a list, and use reduce with add:

from functools import reduce
from operator import add
col_list = [c for c in df.columns if 'plain-' in c]
df = df.withColumn("plain_sum" ,reduce(add, [col(x) for x in col_list]))

df.show(truncate=False)

+------+-------------+-------------+-------------+---------+
|lens  |plain-dev 103|plain-pre 104|plain-rte 107|plain_sum|
+------+-------------+-------------+-------------+---------+
|First |1            |3            |4            |8        |
|Second|2            |5            |3            |10       |
|First |3            |7            |2            |12       |
|Second|4            |8            |1            |13       |
+------+-------------+-------------+-------------+---------+

summing common column values by using pattern matching of column names using pyspark

Related

Recent Posts