Add column of empty lists to DataFrame
Solution 1:
One more way is to use np.empty
:
df['empty_list'] = np.empty((len(df), 0)).tolist()
You could also knock off .index
in your "Method 1" when trying to find len
of df
.
df['empty_list'] = [[] for _ in range(len(df))]
Turns out, np.empty
is faster...
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(pd.np.random.rand(1000000, 5))
In [3]: timeit df['empty1'] = pd.np.empty((len(df), 0)).tolist()
10 loops, best of 3: 127 ms per loop
In [4]: timeit df['empty2'] = [[] for _ in range(len(df))]
10 loops, best of 3: 193 ms per loop
In [5]: timeit df['empty3'] = df.apply(lambda x: [], axis=1)
1 loops, best of 3: 5.89 s per loop
Solution 2:
EDIT: the commenters caught the bug in my answer
s = pd.Series([[]] * 3)
s.iloc[0].append(1) #adding an item only to the first element
>s # unintended consequences:
0 [1]
1 [1]
2 [1]
So, the correct solution is
s = pd.Series([[] for i in range(3)])
s.iloc[0].append(1)
>s
0 [1]
1 []
2 []
OLD:
I timed all the three methods in the accepted answer, the fastest one took 216 ms on my machine. However, this took only 28 ms:
df['empty4'] = [[]] * len(df)
Note: Similarly, df['e5'] = [set()] * len(df)
also took 28ms.
Solution 3:
Canonical solutions: List comprehension, map
and apply
Obligatory disclaimer: avoid using lists in pandas columns where possible, list columns are slow to work with because they are objects and those are inherently hard to vectorize.
With that out of the way, here are the canonical methods of introducing a column of empty lists:
# List comprehension
df['c'] = [[] for _ in range(df.shape[0])]
df
a b c
0 1 5 []
1 2 6 []
2 3 7 []
There's also these shorthands involving apply
and map
:
from collections import defaultdict
# map any column with defaultdict
df['c'] = df.iloc[:,0].map(defaultdict(list))
# same as,
df['c'] = df.iloc[:,0].map(lambda _: [])
# apply with defaultdict
df['c'] = df.apply(defaultdict(list), axis=1)
# same as,
df['c'] = df.apply(lambda _: [], axis=1)
df
a b c
0 1 5 []
1 2 6 []
2 3 7 []
Things you should NOT do
Some folks believe multiplying an empty list is the way to go, unfortunately this is wrong and will usually lead to hard-to-debug issues. Here's an MVP:
# WRONG
df['c'] = [[]] * len(df)
df.at[0, 'c'].append('abc')
df.at[1, 'c'].append('def')
df
a b c
0 1 5 [abc, def]
1 2 6 [abc, def]
2 3 7 [abc, def]
# RIGHT
df['c'] = [[] for _ in range(df.shape[0])]
df.at[0, 'c'].append('abc')
df.at[1, 'c'].append('def')
df
a b c
0 1 5 [abc]
1 2 6 [def]
2 3 7 []
In the first case, a single empty list is created and its reference is replicated across all the rows, so you see updates to one reflected to all of them. In the latter case each row is assigned its own empty list, so this is not a concern.