Pyspark: explode json in column to multiple columns
As long as you are using Spark version 2.1 or higher, pyspark.sql.functions.from_json
should get you your desired result, but you would need to first define the required schema
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType(
[
StructField('key1', StringType(), True),
StructField('key2', StringType(), True)
]
)
df.withColumn("data", from_json("data", schema))\
.select(col('id'), col('point'), col('data.*'))\
.show()
which should give you
+---+-----+----+----+
| id|point|key1|key2|
+---+-----+----+----+
|abc| 6| 124| 345|
|df1| 7| 777| 888|
|4bd| 6| 111| 788|
+---+-----+----+----+
As suggested by @pault, the data field is a string
field. since the keys are the same (i.e. 'key1', 'key2') in the JSON string over rows, you might also use json_tuple()
(this function is New in version 1.6 based on the documentation)
from pyspark.sql import functions as F
df.select('id', 'point', F.json_tuple('data', 'key1', 'key2').alias('key1', 'key2')).show()
Below is My original post: which is most likely WRONG if the original table is from df.show(truncate=False)
and thus the data
field is NOT a python data structure.
Since you have exploded the data into rows, I supposed the column data
is a Python data structure instead of a string:
from pyspark.sql import functions as F
df.select('id', 'point', F.col('data').getItem('key1').alias('key1'), F.col('data')['key2'].alias('key2')).show()
This works for my use case
data1 = spark.read.parquet(path)
json_schema = spark.read.json(data1.rdd.map(lambda row: row.json_col)).schema
data2 = data1.withColumn("data", from_json("json_col", json_schema))
col1 = data2.columns
col1.remove("data")
col2 = data2.select("data.*").columns
append_str ="data."
col3 = [append_str + val for val in col2]
col_list = col1 + col3
data3 = data2.select(*col_list).drop("json_col")