Python Pandas - Changing some column types to categories
Sometimes, you just have to use a for-loop:
for col in ['parks', 'playgrounds', 'sports', 'roading']:
public[col] = public[col].astype('category')
You can use the pandas.DataFrame.apply
method along with a lambda
expression to solve this. In your example you could use
df[['parks', 'playgrounds', 'sports']].apply(lambda x: x.astype('category'))
I don't know of a way to execute this inplace, so typically I'll end up with something like this:
df[df.select_dtypes(['object']).columns] = df.select_dtypes(['object']).apply(lambda x: x.astype('category'))
Obviously you can replace .select_dtypes
with explicit column names if you don't want to select all of a certain datatype (although in your example it seems like you wanted all object
types).
No need for loops, Pandas can do it directly now, just pass a list of columns you want to convert and Pandas will convert them all.
cols = ['parks', 'playgrounds', 'sports', 'roading']
public[cols] = public[cols].astype('category')
df = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['c', 'd', 'e']})
>> a b
>> 0 a c
>> 1 b d
>> 2 c e
df.dtypes
>> a object
>> b object
>> dtype: object
df[df.columns] = df[df.columns].astype('category')
df.dtypes
>> a category
>> b category
>> dtype: object
As of pandas 0.19.0, What's New describes that read_csv
supports parsing Categorical
columns directly.
This answer applies only if you're starting from read_csv
otherwise, I think unutbu's answer is still best.
Example on 10,000 records:
import pandas as pd
import numpy as np
# Generate random data, four category-like columns, two int columns
N=10000
categories = pd.DataFrame({
'parks' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
'playgrounds' : np.random.choice(['strongly agree','agree', 'disagree'], size=N),
'sports' : np.random.choice(['important', 'very important', 'not important'], size=N),
'roading' : np.random.choice(['important', 'very important', 'not important'], size=N),
'resident' : np.random.choice([1, 2, 3], size=N),
'children' : np.random.choice([0, 1, 2, 3], size=N)
})
categories.to_csv('categories_large.csv', index=False)
<0.19.0 (or >=19.0 without specifying dtype)
pd.read_csv('categories_large.csv').dtypes # inspect default dtypes
children int64
parks object
playgrounds object
resident int64
roading object
sports object
dtype: object
>=0.19.0
For mixed dtypes
parsing as Categorical
can be implemented by passing a dictionary dtype={'colname' : 'category', ...}
in read_csv
.
pd.read_csv('categories_large.csv', dtype={'parks': 'category',
'playgrounds': 'category',
'sports': 'category',
'roading': 'category'}).dtypes
children int64
parks category
playgrounds category
resident int64
roading category
sports category
dtype: object
Performance
A slight speed-up (local jupyter notebook), as mentioned in the release notes.
# unutbu's answer
%%timeit
public = pd.read_csv('categories_large.csv')
for col in ['parks', 'playgrounds', 'sports', 'roading']:
public[col] = public[col].astype('category')
10 loops, best of 3: 20.1 ms per loop
# parsed during read_csv
%%timeit
category_cols = {item: 'category' for item in ['parks', 'playgrounds', 'sports', 'roading']}
public = pd.read_csv('categories_large.csv', dtype=category_cols)
100 loops, best of 3: 14.3 ms per loop