Split (explode) pandas dataframe string entry to separate rows
I have a pandas dataframe
in which one column of text strings contains comma-separated values. I want to split each CSV field and create a new row per entry (assume that CSV are clean and need only be split on ','). For example, a
should become b
:
In [7]: a
Out[7]:
var1 var2
0 a,b,c 1
1 d,e,f 2
In [8]: b
Out[8]:
var1 var2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
So far, I have tried various simple functions, but the .apply
method seems to only accept one row as return value when it is used on an axis, and I can't get .transform
to work. Any suggestions would be much appreciated!
Example data:
from pandas import DataFrame
import numpy as np
a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
{'var1': 'd,e,f', 'var2': 2}])
b = DataFrame([{'var1': 'a', 'var2': 1},
{'var1': 'b', 'var2': 1},
{'var1': 'c', 'var2': 1},
{'var1': 'd', 'var2': 2},
{'var1': 'e', 'var2': 2},
{'var1': 'f', 'var2': 2}])
I know this won't work because we lose DataFrame meta-data by going through numpy, but it should give you a sense of what I tried to do:
def fun(row):
letters = row['var1']
letters = letters.split(',')
out = np.array([row] * len(letters))
out['var1'] = letters
a['idx'] = range(a.shape[0])
z = a.groupby('idx')
z.transform(fun)
UPDATE 3: it makes more sense to use Series.explode()
/ DataFrame.explode()
methods (implemented in Pandas 0.25.0 and extended in Pandas 1.3.0 to support multi-column explode) as is shown in the usage example:
for a single column:
In [1]: df = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]],
...: 'B': 1,
...: 'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]})
In [2]: df
Out[2]:
A B C
0 [0, 1, 2] 1 [a, b, c]
1 foo 1 NaN
2 [] 1 []
3 [3, 4] 1 [d, e]
In [3]: df.explode('A')
Out[3]:
A B C
0 0 1 [a, b, c]
0 1 1 [a, b, c]
0 2 1 [a, b, c]
1 foo 1 NaN
2 NaN 1 []
3 3 1 [d, e]
3 4 1 [d, e]
for multiple columns (for Pandas 1.3.0+):
In [4]: df.explode(['A', 'C'])
Out[4]:
A B C
0 0 1 a
0 1 1 b
0 2 1 c
1 foo 1 NaN
2 NaN 1 NaN
3 3 1 d
3 4 1 e
UPDATE 2: more generic vectorized function, which will work for multiple normal
and multiple list
columns
def explode(df, lst_cols, fill_value='', preserve_index=False):
# make sure `lst_cols` is list-alike
if (lst_cols is not None
and len(lst_cols) > 0
and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
lst_cols = [lst_cols]
# all columns except `lst_cols`
idx_cols = df.columns.difference(lst_cols)
# calculate lengths of lists
lens = df[lst_cols[0]].str.len()
# preserve original index values
idx = np.repeat(df.index.values, lens)
# create "exploded" DF
res = (pd.DataFrame({
col:np.repeat(df[col].values, lens)
for col in idx_cols},
index=idx)
.assign(**{col:np.concatenate(df.loc[lens>0, col].values)
for col in lst_cols}))
# append those rows that have empty lists
if (lens == 0).any():
# at least one list in cells is empty
res = (res.append(df.loc[lens==0, idx_cols], sort=False)
.fillna(fill_value))
# revert the original index order
res = res.sort_index()
# reset index if requested
if not preserve_index:
res = res.reset_index(drop=True)
return res
Demo:
Multiple list
columns - all list
columns must have the same # of elements in each row:
In [134]: df
Out[134]:
aaa myid num text
0 10 1 [1, 2, 3] [aa, bb, cc]
1 11 2 [] []
2 12 3 [1, 2] [cc, dd]
3 13 4 [] []
In [135]: explode(df, ['num','text'], fill_value='')
Out[135]:
aaa myid num text
0 10 1 1 aa
1 10 1 2 bb
2 10 1 3 cc
3 11 2
4 12 3 1 cc
5 12 3 2 dd
6 13 4
preserving original index values:
In [136]: explode(df, ['num','text'], fill_value='', preserve_index=True)
Out[136]:
aaa myid num text
0 10 1 1 aa
0 10 1 2 bb
0 10 1 3 cc
1 11 2
2 12 3 1 cc
2 12 3 2 dd
3 13 4
Setup:
df = pd.DataFrame({
'aaa': {0: 10, 1: 11, 2: 12, 3: 13},
'myid': {0: 1, 1: 2, 2: 3, 3: 4},
'num': {0: [1, 2, 3], 1: [], 2: [1, 2], 3: []},
'text': {0: ['aa', 'bb', 'cc'], 1: [], 2: ['cc', 'dd'], 3: []}
})
CSV column:
In [46]: df
Out[46]:
var1 var2 var3
0 a,b,c 1 XX
1 d,e,f,x,y 2 ZZ
In [47]: explode(df.assign(var1=df.var1.str.split(',')), 'var1')
Out[47]:
var1 var2 var3
0 a 1 XX
1 b 1 XX
2 c 1 XX
3 d 2 ZZ
4 e 2 ZZ
5 f 2 ZZ
6 x 2 ZZ
7 y 2 ZZ
using this little trick we can convert CSV-like column to list
column:
In [48]: df.assign(var1=df.var1.str.split(','))
Out[48]:
var1 var2 var3
0 [a, b, c] 1 XX
1 [d, e, f, x, y] 2 ZZ
UPDATE: generic vectorized approach (will work also for multiple columns):
Original DF:
In [177]: df
Out[177]:
var1 var2 var3
0 a,b,c 1 XX
1 d,e,f,x,y 2 ZZ
Solution:
first let's convert CSV strings to lists:
In [178]: lst_col = 'var1'
In [179]: x = df.assign(**{lst_col:df[lst_col].str.split(',')})
In [180]: x
Out[180]:
var1 var2 var3
0 [a, b, c] 1 XX
1 [d, e, f, x, y] 2 ZZ
Now we can do this:
In [181]: pd.DataFrame({
...: col:np.repeat(x[col].values, x[lst_col].str.len())
...: for col in x.columns.difference([lst_col])
...: }).assign(**{lst_col:np.concatenate(x[lst_col].values)})[x.columns.tolist()]
...:
Out[181]:
var1 var2 var3
0 a 1 XX
1 b 1 XX
2 c 1 XX
3 d 2 ZZ
4 e 2 ZZ
5 f 2 ZZ
6 x 2 ZZ
7 y 2 ZZ
OLD answer:
Inspired by @AFinkelstein solution, i wanted to make it bit more generalized which could be applied to DF with more than two columns and as fast, well almost, as fast as AFinkelstein's solution):
In [2]: df = pd.DataFrame(
...: [{'var1': 'a,b,c', 'var2': 1, 'var3': 'XX'},
...: {'var1': 'd,e,f,x,y', 'var2': 2, 'var3': 'ZZ'}]
...: )
In [3]: df
Out[3]:
var1 var2 var3
0 a,b,c 1 XX
1 d,e,f,x,y 2 ZZ
In [4]: (df.set_index(df.columns.drop('var1',1).tolist())
...: .var1.str.split(',', expand=True)
...: .stack()
...: .reset_index()
...: .rename(columns={0:'var1'})
...: .loc[:, df.columns]
...: )
Out[4]:
var1 var2 var3
0 a 1 XX
1 b 1 XX
2 c 1 XX
3 d 2 ZZ
4 e 2 ZZ
5 f 2 ZZ
6 x 2 ZZ
7 y 2 ZZ
After painful experimentation to find something faster than the accepted answer, I got this to work. It ran around 100x faster on the dataset I tried it on.
If someone knows a way to make this more elegant, by all means please modify my code. I couldn't find a way that works without setting the other columns you want to keep as the index and then resetting the index and re-naming the columns, but I'd imagine there's something else that works.
b = DataFrame(a.var1.str.split(',').tolist(), index=a.var2).stack()
b = b.reset_index()[[0, 'var2']] # var1 variable is currently labeled 0
b.columns = ['var1', 'var2'] # renaming var1
Pandas >= 0.25
Series and DataFrame methods define a .explode()
method that explodes lists into separate rows. See the docs section on Exploding a list-like column.
Since you have a list of comma separated strings, split the string on comma to get a list of elements, then call explode
on that column.
df = pd.DataFrame({'var1': ['a,b,c', 'd,e,f'], 'var2': [1, 2]})
df
var1 var2
0 a,b,c 1
1 d,e,f 2
df.assign(var1=df['var1'].str.split(',')).explode('var1')
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
1 f 2
Note that explode
only works on a single column (for now). To explode multiple columns at once, see below.
NaNs and empty lists get the treatment they deserve without you having to jump through hoops to get it right.
df = pd.DataFrame({'var1': ['d,e,f', '', np.nan], 'var2': [1, 2, 3]})
df
var1 var2
0 d,e,f 1
1 2
2 NaN 3
df['var1'].str.split(',')
0 [d, e, f]
1 []
2 NaN
df.assign(var1=df['var1'].str.split(',')).explode('var1')
var1 var2
0 d 1
0 e 1
0 f 1
1 2 # empty list entry becomes empty string after exploding
2 NaN 3 # NaN left un-touched
This is a serious advantage over ravel
/repeat
-based solutions (which ignore empty lists completely, and choke on NaNs).
Exploding Multiple Columns
Note that explode
only works on a single column at a time, but you can use apply
to explode multiple column at once:
df = pd.DataFrame({'var1': ['a,b,c', 'd,e,f'],
'var2': ['i,j,k', 'l,m,n'],
'var3': [1, 2]})
df
var1 var2 var3
0 a,b,c i,j,k 1
1 d,e,f l,m,n 2
(df.set_index(['var3'])
.apply(lambda col: col.str.split(',').explode())
.reset_index()
.reindex(df.columns, axis=1))
df
var1 var2 var3
0 a i 1
1 b j 1
2 c k 1
3 d l 2
4 e m 2
5 f n 2
The idea is to set as the index, all the columns that should NOT be exploded, then explode the remaining columns via apply
. This works well when the lists are equally sized.
How about something like this:
In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))
for _, row in a.iterrows()]).reset_index()
Out[55]:
index 0
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
Then you just have to rename the columns