Create dummies from column with multiple values in pandas
Solution 1:
I know it's been a while since this question was asked, but there is (at least now there is) a one-liner that is supported by the documentation:
In [4]: df
Out[4]:
label
0 (a, c, e)
1 (a, d)
2 (b,)
3 (d, e)
In [5]: df['label'].str.join(sep='*').str.get_dummies(sep='*')
Out[5]:
a b c d e
0 1 0 1 0 1
1 1 0 0 1 0
2 0 1 0 0 0
3 0 0 0 1 1
Solution 2:
I have a somewhat cleaner solution. Assume we want to transform the following dataframe
pageid category
0 0 a
1 0 b
2 1 a
3 1 c
into
a b c
pageid
0 1 1 0
1 1 0 1
One way to do it is to make use of scikit-learn's DictVectorizer. I would, however, be interested in learning about other methods.
df = pd.DataFrame(dict(pageid=[0, 0, 1, 1], category=['a', 'b', 'a', 'c']))
grouped = df.groupby('pageid').category.apply(lambda lst: tuple((k, 1) for k in lst))
category_dicts = [dict(tuples) for tuples in grouped]
v = sklearn.feature_extraction.DictVectorizer(sparse=False)
X = v.fit_transform(category_dicts)
pd.DataFrame(X, columns=v.get_feature_names(), index=grouped.index)