Convert multiple columns in pyspark dataframe into one dictionary
I have created a PySpark DataFrame like this one: '''
df = spark.createDataFrame([
('v', 3, 'a'),
('d', 2, 'b'),
('q', 9, 'c')],
["c1", "c2", "c3"]
)
df.show()
c1 | c2 | c3
v | 3 | a
d | 2 | b
q | 9 | c
I want to create a new column like this:
+--------------------------+
| c4 |
+--------------------------+
|{"c1":"v","c2":3,"c3":"a"}|
|{"c1":"d","c2":2,"c3":"b"}|
|{"c1":"q","c2":9,"c3":"c"}|
+--------------------------+
I want c4
to be in type of MapType
, not StringType
Also, I want to keep the type of values as it is. (keep 3,2 and 9 as integers, not String)
Solution 1:
Use struct
+ to_json
like this if you want to get JSON strings:
import pyspark.sql.functions as F
df1 = df.select(
F.to_json(
F.struct(*[F.col(c) for c in df.columns])
).alias("c4")
)
df1.show(truncate=False)
#+--------------------------+
#|c4 |
#+--------------------------+
#|{"c1":"v","c2":3,"c3":"a"}|
#|{"c1":"d","c2":2,"c3":"b"}|
#|{"c1":"q","c2":9,"c3":"c"}|
#+--------------------------+
EDIT
If you want a MapType column use create_map
function:
from itertools import chain
df1 = df.select(
F.create_map(
*list(chain(*[[F.lit(c), F.col(c)] for c in df.columns]))
).alias("c4")
)
#+---------------------------+
#|c4 |
#+---------------------------+
#|{c1 -> v, c2 -> 3, c3 -> a}|
#|{c1 -> d, c2 -> 2, c3 -> b}|
#|{c1 -> q, c2 -> 9, c3 -> c}|
#+---------------------------+