Unnest (explode) a Pandas Series
Solution 1:
Using list
+ str.join
and np.repeat
-
pd.DataFrame(
{
'col1' : list(''.join(df.col1)),
'col2' : df.col2.values.repeat(df.col1.str.len(), axis=0)
})
col1 col2
0 a 1
1 s 1
2 d 1
3 f 1
4 x 2
5 y 2
6 q 3
A generalised solution for any number of columns is easily achievable, without much change to the solution -
i = list(''.join(df.col1))
j = df.drop('col1', 1).values.repeat(df.col1.str.len(), axis=0)
df = pd.DataFrame(j, columns=df.columns.difference(['col1']))
df.insert(0, 'col1', i)
df
col1 col2
0 a 1
1 s 1
2 d 1
3 f 1
4 x 2
5 y 2
6 q 3
Performance
df = pd.concat([df] * 100000, ignore_index=True)
# MaxU's solution
%%timeit
df.col1.str.extractall(r'(.)') \
.reset_index(level=1, drop=True) \
.join(df['col2']) \
.reset_index(drop=True)
1 loop, best of 3: 1.98 s per loop
# piRSquared's solution
%%timeit
pd.DataFrame(
[[x] + b for a, *b in df.values for x in a],
columns=df.columns
)
1 loop, best of 3: 1.68 s per loop
# Wen's solution
%%timeit
v = df.col1.apply(list)
pd.DataFrame({'col1':np.concatenate(v.values),'col2':df.col2.repeat(v.apply(len))})
1 loop, best of 3: 835 ms per loop
# Alexander's solution
%%timeit
pd.DataFrame([(letter, i)
for letters, i in zip(df['col1'], df['col2'])
for letter in letters],
columns=df.columns)
1 loop, best of 3: 316 ms per loop
%%timeit
pd.DataFrame(
{
'col1' : list(''.join(df.col1)),
'col2' : df.col2.values.repeat(df.col1.str.len(), axis=0)
})
10 loops, best of 3: 124 ms per loop
I tried timing Vaishali's, but it took too long on this dataset.
Solution 2:
pd.DataFrame([(letter, i)
for letters, i in zip(df['col1'], df['col2'])
for letter in letters],
columns=df.columns)
Solution 3:
Trick from the list
:-)
df.col1=df.col1.apply(list)
df
Out[489]:
col1 col2
0 [a, s, d, f] 1
1 [x, y] 2
2 [q] 3
pd.DataFrame({'col1':np.concatenate(df.col1.values),'col2':df.col2.repeat(df.col1.apply(len))})
Out[490]:
col1 col2
0 a 1
0 s 1
0 d 1
0 f 1
1 x 2
1 y 2
2 q 3
Solution 4:
In [86]: df.col1.str.extractall(r'(.)') \
.reset_index(level=1, drop=True) \
.join(df['col2']) \
.reset_index(drop=True)
Out[86]:
0 col2
0 a 1
1 s 1
2 d 1
3 f 1
4 x 2
5 y 2
6 q 3