Flatten JSON Columns in Dataframe
I have data in a dataframe that resembles below. I am trying to flatten the data so the values in the JSON blobs convert to columns and if there are multiple records in the JSON blob it creates a new row. I would like to leave non-JSON fields (i.e ID) as is.
I have specifically followed: How to flatten a pandas dataframe with some columns as json? - but after execution I am left unsuccessful with the same dataframe with unparsed JSON. Below is a snippet of the function from that post and that has worked for me before:
def flatten_nested_json_df(df):
df = df.reset_index()
print(f"original shape: {df.shape}")
print(f"original columns: {df.columns}")
# search for columns to explode/flatten
s = (df.applymap(type) == list).all()
list_columns = s[s].index.tolist()
s = (df.applymap(type) == dict).all()
dict_columns = s[s].index.tolist()
print(f"lists: {list_columns}, dicts: {dict_columns}")
while len(list_columns) > 0 or len(dict_columns) > 0:
new_columns = []
for col in dict_columns:
print(f"flattening: {col}")
# explode dictionaries horizontally, adding new columns
horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
horiz_exploded.index = df.index
df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
new_columns.extend(horiz_exploded.columns) # inplace
for col in list_columns:
print(f"exploding: {col}")
# explode lists vertically, adding new columns
df = df.drop(columns=[col]).join(df[col].explode().to_frame())
new_columns.append(col)
# check if there are still dict o list fields to flatten
s = (df[new_columns].applymap(type) == list).all()
list_columns = s[s].index.tolist()
s = (df[new_columns].applymap(type) == dict).all()
dict_columns = s[s].index.tolist()
print(f"lists: {list_columns}, dicts: {dict_columns}")
print(f"final shape: {df.shape}")
print(f"final columns: {df.columns}")
return df
Any help is appreciated!
ID | PROPERTIES | FORMSUBMISSIONS |
---|---|---|
123 | {"firstname":{"value":"FAKE"},"lastmodifieddate":{"value":"FAKE"},"lastname":{"value":"FAKE"}} | [{"contact-associated-by":["FAKE"],"conversion-id":"FAKE","form-id":"FAKE","form-type":"FAKE","meta-data":[],"portal-id": FAKE,"timestamp": FAKE,"title":"FAKE"}] |
You can use pd.json_normalize
which should be more simple.
>>> df
ID PROPERTIES FORMSUBMISSIONS
0 123 {'firstname': {'value': 'FAKE'}, 'lastmodified... [{'contact-associated-by': ['FAKE'], 'conversi...
>>> df = df.explode('FORMSUBMISSIONS') # list to dict
>>> df
ID PROPERTIES FORMSUBMISSIONS
0 123 {'firstname': {'value': 'FAKE'}, 'lastmodified... {'contact-associated-by': ['FAKE'], 'conversio...
Now you can do json_normalize
on the FORMSUBMISSIONS column. To preserve the other columns, I use pd.concat
>>> df = pd.concat([df, pd.json_normalize(df['FORMSUBMISSIONS']), axis=1).drop('FORMSUBMISSIONS', axis=1)
>>> df
ID PROPERTIES contact-associated-by conversion-id form-id form-type meta-data portal-id timestamp title
0 123 {'firstname': {'value': 'FAKE'}, 'lastmodified... [FAKE] FAKE FAKE FAKE [] FAKE FAKE FAKE
You can do the same thing on PROPERTIES column.
df = pd.concat([df, pd.json_normalize(df.PROPERTIES)], axis=1).drop('PROPERTIES', axis=1)