Pandas 'describe' is not returning summary of all columns
I am running 'describe()' on a dataframe and getting summaries of only int columns (pandas 14.0).
The documentation says that for object columns frequency of most common value, and additional statistics would be returned. What could be wrong? (no error message is returned by the way)
Edit:
I think it's how the function is set to behave on mixed column types in a dataframe. Although the documentation fails to mention it.
Example code:
df_test = pd.DataFrame({'$a':[1,2], '$b': [10,20]})
df_test.dtypes
df_test.describe()
df_test['$a'] = df_test['$a'].astype(str)
df_test.describe()
df_test['$a'].describe()
df_test['$b'].describe()
My ugly work around in the meanwhile:
def my_df_describe(df):
objects = []
numerics = []
for c in df:
if (df[c].dtype == object):
objects.append(c)
else:
numerics.append(c)
return df[numerics].describe(), df[objects].describe()
As of pandas v15.0, use the parameter, DataFrame.describe(include = 'all')
to get a summary of all the columns when the dataframe has mixed column types. The default behavior is to only provide a summary for the numerical columns.
Example:
In[1]:
df = pd.DataFrame({'$a':['a', 'b', 'c', 'd', 'a'], '$b': np.arange(5)})
df.describe(include = 'all')
Out[1]:
$a $b
count 5 5.000000
unique 4 NaN
top a NaN
freq 2 NaN
mean NaN 2.000000
std NaN 1.581139
min NaN 0.000000
25% NaN 1.000000
50% NaN 2.000000
75% NaN 3.000000
max NaN 4.000000
The numerical columns will have NaNs for summary statistics pertaining to objects (strings) and vice versa.
Summarizing only numerical or object columns
- To call
describe()
on just the numerical columns usedescribe(include = [np.number])
-
To call
describe()
on just the objects (strings) usingdescribe(include = ['O'])
.In[2]: df.describe(include = [np.number]) Out[3]: $b count 5.000000 mean 2.000000 std 1.581139 min 0.000000 25% 1.000000 50% 2.000000 75% 3.000000 max 4.000000 In[3]: df.describe(include = ['O']) Out[3]: $a count 5 unique 4 top a freq 2
pd.options.display.max_columns = DATA.shape[1]
will work.
Here DATA
is a 2d matrix, and above code will display stats vertically.