make pandas DataFrame to a dict and dropna
I have some pandas DataFrame with NaNs in it. Like this:
import pandas as pd
import numpy as np
raw_data={'A':{1:2,2:3,3:4},'B':{1:np.nan,2:44,3:np.nan}}
data=pd.DataFrame(raw_data)
>>> data
A B
1 2 NaN
2 3 44
3 4 NaN
Now I want to make a dict out of it and at the same time remove the NaNs. The result should look like this:
{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}
But using pandas to_dict function gives me a result like this:
>>> data.to_dict()
{'A': {1: 2, 2: 3, 3: 4}, 'B': {1: nan, 2: 44.0, 3: nan}}
So how to make a dict out of the DataFrame and get rid of the NaNs ?
There are many ways you could accomplish this, I spent some time evaluating performance on a not-so-large (70k) dataframe. Although @der_die_das_jojo's answer is functional, it's also pretty slow.
The answer suggested by this question actually turns out to be about 5x faster on a large dataframe.
On my test dataframe (df
):
Above method:
%time [ v.dropna().to_dict() for k,v in df.iterrows() ]
CPU times: user 51.2 s, sys: 0 ns, total: 51.2 s
Wall time: 50.9 s
Another slow method:
%time df.apply(lambda x: [x.dropna()], axis=1).to_dict(orient='rows')
CPU times: user 1min 8s, sys: 880 ms, total: 1min 8s
Wall time: 1min 8s
Fastest method I could find:
%time [ {k:v for k,v in m.items() if pd.notnull(v)} for m in df.to_dict(orient='rows')]
CPU times: user 14.5 s, sys: 176 ms, total: 14.7 s
Wall time: 14.7 s
The format of this output is a row-oriented dictionary, you may need to make adjustments if you want the column-oriented form in the question.
Very interested if anyone finds an even faster answer to this question.
First graph generate dictionaries per columns, so output is few very long dictionaries, number of dicts depends of number of columns.
I test multiple methods with perfplot
and fastest method is loop by each column and remove missing values or None
s by Series.dropna
or with Series.notna
in boolean indexing
in larger DataFrames.
Is smaller DataFrames is fastest dictionary comprehension with testing missing values by NaN != NaN
trick and also testing None
s.
np.random.seed(2020)
import perfplot
def comp_notnull(df1):
return {k1: {k:v for k,v in v1.items() if pd.notnull(v)} for k1, v1 in df1.to_dict().items()}
def comp_NaNnotNaN_None(df1):
return {k1: {k:v for k,v in v1.items() if v == v and v is not None} for k1, v1 in df1.to_dict().items()}
def comp_dropna(df1):
return {k: v.dropna().to_dict() for k,v in df1.items()}
def comp_bool_indexing(df1):
return {k: v[v.notna()].to_dict() for k,v in df1.items()}
def make_df(n):
df1 = pd.DataFrame(np.random.choice([1,2, np.nan], size=(n, 5)), columns=list('ABCDE'))
return df1
perfplot.show(
setup=make_df,
kernels=[comp_dropna, comp_bool_indexing, comp_notnull, comp_NaNnotNaN_None],
n_range=[10**k for k in range(1, 7)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
Another situtation is if generate dictionaries per rows - get list of huge amount of small dictionaries, then fastest is list comprehension with filtering NaNs and Nones:
np.random.seed(2020)
import perfplot
def comp_notnull1(df1):
return [{k:v for k,v in m.items() if pd.notnull(v)} for m in df1.to_dict(orient='r')]
def comp_NaNnotNaN_None1(df1):
return [{k:v for k,v in m.items() if v == v and v is not None} for m in df1.to_dict(orient='r')]
def comp_dropna1(df1):
return [v.dropna().to_dict() for k,v in df1.T.items()]
def comp_bool_indexing1(df1):
return [v[v.notna()].to_dict() for k,v in df1.T.items()]
def make_df(n):
df1 = pd.DataFrame(np.random.choice([1,2, np.nan], size=(n, 5)), columns=list('ABCDE'))
return df1
perfplot.show(
setup=make_df,
kernels=[comp_dropna1, comp_bool_indexing1, comp_notnull1, comp_NaNnotNaN_None1],
n_range=[10**k for k in range(1, 7)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
write a function insired by to_dict from pandas
import pandas as pd
import numpy as np
from pandas import compat
def to_dict_dropna(self,data):
return dict((k, v.dropna().to_dict()) for k, v in compat.iteritems(data))
raw_data={'A':{1:2,2:3,3:4},'B':{1:np.nan,2:44,3:np.nan}}
data=pd.DataFrame(raw_data)
dict=to_dict_dropna(data)
and as a result you get what you want:
>>> dict
{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}
you can have your own mapping class where you can get rid of the NANs:
class NotNanDict(dict):
@staticmethod
def is_nan(v):
if isinstance(v, dict):
return False
return np.isnan(v)
def __new__(self, a):
return {k: v for k, v in a if not self.is_nan(v)}
data.to_dict(into=NotNanDict)
Output:
{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}
Timing (from @jezrael answer):
to boost the speed you can use numba:
from numba import jit
@jit
def dropna(arr):
return [(i + 1, n) for i, n in enumerate(arr) if not np.isnan(n)]
class NotNanDict(dict):
def __new__(self, a):
return {k: dict(dropna(v.to_numpy())) for k, v in a}
data.to_dict(orient='s', into=NotNanDict)
output:
{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}
Timing (from @jezrael answer):
You can use a dict comprehension and loop over the columns
{col:df[col].dropna().to_dict() for col in df}