Split / Explode a column of dictionaries into separate columns with pandas

I have data saved in a postgreSQL database. I am querying this data using Python2.7 and turning it into a Pandas DataFrame. However, the last column of this dataframe has a dictionary of values inside it. The DataFrame df looks like this:

Station ID     Pollutants
8809           {"a": "46", "b": "3", "c": "12"}
8810           {"a": "36", "b": "5", "c": "8"}
8811           {"b": "2", "c": "7"}
8812           {"c": "11"}
8813           {"a": "82", "c": "15"}

I need to split this column into separate columns, so that the DataFrame `df2 looks like this:

Station ID     a      b       c
8809           46     3       12
8810           36     5       8
8811           NaN    2       7
8812           NaN    NaN     11
8813           82     NaN     15

The major issue I'm having is that the lists are not the same lengths. But all of the lists only contain up to the same 3 values: 'a', 'b', and 'c'. And they always appear in the same order ('a' first, 'b' second, 'c' third).

The following code USED to work and return exactly what I wanted (df2).

objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]]
df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1)
print(df2)

I was running this code just last week and it was working fine. But now my code is broken and I get this error from line [4]:

IndexError: out-of-bounds on slice (end) 

I made no changes to the code but am now getting the error. I feel this is due to my method not being robust or proper.

Any suggestions or guidance on how to split this column of lists into separate columns would be super appreciated!

EDIT: I think the .tolist() and .apply methods are not working on my code because it is one Unicode string, i.e.:

#My data format 
u{'a': '1', 'b': '2', 'c': '3'}

#and not
{u'a': '1', u'b': '2', u'c': '3'}

The data is imported from the postgreSQL database in this format. Any help or ideas with this issue? is there a way to convert the Unicode?


Solution 1:

To convert the string to an actual dict, you can do df['Pollutant Levels'].map(eval). Afterwards, the solution below can be used to convert the dict to different columns.


Using a small example, you can use .apply(pd.Series):

In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})

In [3]: df
Out[3]:
   a                   b
0  1           {u'c': 1}
1  2           {u'd': 3}
2  3  {u'c': 5, u'd': 6}

In [4]: df['b'].apply(pd.Series)
Out[4]:
     c    d
0  1.0  NaN
1  NaN  3.0
2  5.0  6.0

To combine it with the rest of the dataframe, you can concat the other columns with the above result:

In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
Out[7]:
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0

Using your code, this also works if I leave out the iloc part:

In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
Out[15]:
   a    c    d
0  1  1.0  NaN
1  2  NaN  3.0
2  3  5.0  6.0

Solution 2:

I know the question is quite old, but I got here searching for answers. There is actually a better (and faster) way now of doing this using json_normalize:

import pandas as pd

df2 = pd.json_normalize(df['Pollutant Levels'])

This avoids costly apply functions...

Solution 3:

  • The fastest method to normalize a column of flat, one-level dicts, as per the timing analysis performed by Shijith in this answer:
    • df.join(pd.DataFrame(df.pop('Pollutants').values.tolist()))
    • It will not resolve other issues with columns of list or dicts that are addressed below, such as rows with NaN, or nested dicts.
  1. pd.json_normalize(df.Pollutants) is significantly faster than df.Pollutants.apply(pd.Series)
    • See the %%timeit below. For 1M rows, .json_normalize is 47 times faster than .apply.
  2. Whether reading data from a file, or from an object returned by a database, or API, it may not be clear if the dict column has dict or str type.
    • If the dictionaries in the column are str type, they must be converted back to a dict type, using ast.literal_eval, or json.loads(…).
  3. Use pd.json_normalize to convert the dicts, with keys as headers and values for rows.
    • There are additional parameters (e.g. record_path & meta) for dealing with nested dicts.
  4. Use pandas.DataFrame.join to combine the original DataFrame, df, with the columns created using pd.json_normalize
    • If the index isn't integers (as in the example), first use df.reset_index() to get an index of integers, before doing the normalize and join.
  5. Finally, use pandas.DataFrame.drop, to remove the unneeded column of dicts
  • As a note, if the column has any NaN, they must be filled with an empty dict
    • df.Pollutants = df.Pollutants.fillna({i: {} for i in df.index})
      • If the 'Pollutants' column is strings, use '{}'.
      • Also see How to json_normalize a column with NaNs?.
import pandas as pd
from ast import literal_eval
import numpy as np

data = {'Station ID': [8809, 8810, 8811, 8812, 8813, 8814],
        'Pollutants': ['{"a": "46", "b": "3", "c": "12"}', '{"a": "36", "b": "5", "c": "8"}', '{"b": "2", "c": "7"}', '{"c": "11"}', '{"a": "82", "c": "15"}', np.nan]}

df = pd.DataFrame(data)

# display(df)
   Station ID                        Pollutants
0        8809  {"a": "46", "b": "3", "c": "12"}
1        8810   {"a": "36", "b": "5", "c": "8"}
2        8811              {"b": "2", "c": "7"}
3        8812                       {"c": "11"}
4        8813            {"a": "82", "c": "15"}
5        8814                               NaN

# replace NaN with '{}' if the column is strings, otherwise replace with {}
# df.Pollutants = df.Pollutants.fillna('{}')  # if the NaN is in a column of strings
df.Pollutants = df.Pollutants.fillna({i: {} for i in df.index})  # if the column is not strings

# Convert the column of stringified dicts to dicts
# skip this line, if the column contains dicts
df.Pollutants = df.Pollutants.apply(literal_eval)

# reset the index if the index is not unique integers from 0 to n-1
# df.reset_index(inplace=True)  # uncomment if needed

# normalize the column of dictionaries and join it to df
df = df.join(pd.json_normalize(df.Pollutants))

# drop Pollutants
df.drop(columns=['Pollutants'], inplace=True)

# display(df)
   Station ID    a    b    c
0        8809   46    3   12
1        8810   36    5    8
2        8811  NaN    2    7
3        8812  NaN  NaN   11
4        8813   82  NaN   15
5        8814  NaN  NaN  NaN

%%timeit

# dataframe with 1M rows
dfb = pd.concat([df]*200000).reset_index(drop=True)

%%timeit
dfb.join(pd.json_normalize(dfb.Pollutants))
[out]:
5.44 s ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.concat([dfb.drop(columns=['Pollutants']), dfb.Pollutants.apply(pd.Series)], axis=1)
[out]:
4min 17s ± 2.44 s per loop (mean ± std. dev. of 7 runs, 1 loop each)