make pandas DataFrame to a dict and dropna

I have some pandas DataFrame with NaNs in it. Like this:

import pandas as pd
import numpy as np
raw_data={'A':{1:2,2:3,3:4},'B':{1:np.nan,2:44,3:np.nan}}
data=pd.DataFrame(raw_data)
>>> data
   A   B
1  2 NaN
2  3  44
3  4 NaN

Now I want to make a dict out of it and at the same time remove the NaNs. The result should look like this:

{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}

But using pandas to_dict function gives me a result like this:

>>> data.to_dict()
{'A': {1: 2, 2: 3, 3: 4}, 'B': {1: nan, 2: 44.0, 3: nan}} 

So how to make a dict out of the DataFrame and get rid of the NaNs ?


There are many ways you could accomplish this, I spent some time evaluating performance on a not-so-large (70k) dataframe. Although @der_die_das_jojo's answer is functional, it's also pretty slow.

The answer suggested by this question actually turns out to be about 5x faster on a large dataframe.

On my test dataframe (df):

Above method:

%time [ v.dropna().to_dict() for k,v in df.iterrows() ]
CPU times: user 51.2 s, sys: 0 ns, total: 51.2 s
Wall time: 50.9 s

Another slow method:

%time df.apply(lambda x: [x.dropna()], axis=1).to_dict(orient='rows')
CPU times: user 1min 8s, sys: 880 ms, total: 1min 8s
Wall time: 1min 8s

Fastest method I could find:

%time [ {k:v for k,v in m.items() if pd.notnull(v)} for m in df.to_dict(orient='rows')]
CPU times: user 14.5 s, sys: 176 ms, total: 14.7 s
Wall time: 14.7 s

The format of this output is a row-oriented dictionary, you may need to make adjustments if you want the column-oriented form in the question.

Very interested if anyone finds an even faster answer to this question.


First graph generate dictionaries per columns, so output is few very long dictionaries, number of dicts depends of number of columns.

I test multiple methods with perfplot and fastest method is loop by each column and remove missing values or Nones by Series.dropna or with Series.notna in boolean indexing in larger DataFrames.

Is smaller DataFrames is fastest dictionary comprehension with testing missing values by NaN != NaN trick and also testing Nones.

graph

np.random.seed(2020)
import perfplot

def comp_notnull(df1):
    return {k1: {k:v for k,v in v1.items() if pd.notnull(v)} for k1, v1 in df1.to_dict().items()}

def comp_NaNnotNaN_None(df1):
    return {k1: {k:v for k,v in v1.items() if v == v and v is not None} for k1, v1 in df1.to_dict().items()}

def comp_dropna(df1):
    return {k: v.dropna().to_dict() for k,v in df1.items()}

def comp_bool_indexing(df1):
    return {k: v[v.notna()].to_dict() for k,v in df1.items()}

def make_df(n):
    df1 = pd.DataFrame(np.random.choice([1,2, np.nan], size=(n, 5)), columns=list('ABCDE'))
    return df1

perfplot.show(
    setup=make_df,
    kernels=[comp_dropna, comp_bool_indexing, comp_notnull, comp_NaNnotNaN_None],
    n_range=[10**k for k in range(1, 7)],
    logx=True,
    logy=True,
    equality_check=False,
    xlabel='len(df)')

Another situtation is if generate dictionaries per rows - get list of huge amount of small dictionaries, then fastest is list comprehension with filtering NaNs and Nones:

g1

np.random.seed(2020)
import perfplot


def comp_notnull1(df1):
    return [{k:v for k,v in m.items() if pd.notnull(v)} for m in df1.to_dict(orient='r')]

def comp_NaNnotNaN_None1(df1):
    return [{k:v for k,v in m.items() if v == v and v is not None} for m in df1.to_dict(orient='r')]

def comp_dropna1(df1):
    return [v.dropna().to_dict() for k,v in df1.T.items()]

def comp_bool_indexing1(df1):
    return [v[v.notna()].to_dict() for k,v in df1.T.items()]


def make_df(n):
    df1 = pd.DataFrame(np.random.choice([1,2, np.nan], size=(n, 5)), columns=list('ABCDE'))
    return df1

perfplot.show(
    setup=make_df,
    kernels=[comp_dropna1, comp_bool_indexing1, comp_notnull1, comp_NaNnotNaN_None1],
    n_range=[10**k for k in range(1, 7)],
    logx=True,
    logy=True,
    equality_check=False,
    xlabel='len(df)')

write a function insired by to_dict from pandas

import pandas as pd
import numpy as np
from pandas import compat 

def to_dict_dropna(self,data):
  return dict((k, v.dropna().to_dict()) for k, v in compat.iteritems(data))

raw_data={'A':{1:2,2:3,3:4},'B':{1:np.nan,2:44,3:np.nan}}
data=pd.DataFrame(raw_data)

dict=to_dict_dropna(data)

and as a result you get what you want:

>>> dict
{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}

you can have your own mapping class where you can get rid of the NANs:

class NotNanDict(dict):

    @staticmethod
    def is_nan(v):
        if isinstance(v, dict):
            return False
        return np.isnan(v)

    def __new__(self, a):
        return {k: v for k, v in a if not self.is_nan(v)} 

data.to_dict(into=NotNanDict)

Output:

{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}

Timing (from @jezrael answer):

enter image description here

to boost the speed you can use numba:

from numba import jit

@jit
def dropna(arr):
    return [(i + 1, n) for i, n in enumerate(arr) if not np.isnan(n)]


class NotNanDict(dict):

    def __new__(self, a):
        return {k: dict(dropna(v.to_numpy())) for k, v in a}

data.to_dict(orient='s', into=NotNanDict)

output:

{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}

Timing (from @jezrael answer):

enter image description here


You can use a dict comprehension and loop over the columns

{col:df[col].dropna().to_dict() for col in df}