Add column of empty lists to DataFrame

Solution 1:

One more way is to use np.empty:

df['empty_list'] = np.empty((len(df), 0)).tolist()

You could also knock off .index in your "Method 1" when trying to find len of df.

df['empty_list'] = [[] for _ in range(len(df))]

Turns out, np.empty is faster...

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(pd.np.random.rand(1000000, 5))

In [3]: timeit df['empty1'] = pd.np.empty((len(df), 0)).tolist()
10 loops, best of 3: 127 ms per loop

In [4]: timeit df['empty2'] = [[] for _ in range(len(df))]
10 loops, best of 3: 193 ms per loop

In [5]: timeit df['empty3'] = df.apply(lambda x: [], axis=1)
1 loops, best of 3: 5.89 s per loop

Solution 2:

EDIT: the commenters caught the bug in my answer

s = pd.Series([[]] * 3)
s.iloc[0].append(1) #adding an item only to the first element
>s # unintended consequences:
0    [1]
1    [1]
2    [1]

So, the correct solution is

s = pd.Series([[] for i in range(3)])
s.iloc[0].append(1)
>s
0    [1]
1     []
2     []

OLD:

I timed all the three methods in the accepted answer, the fastest one took 216 ms on my machine. However, this took only 28 ms:

df['empty4'] = [[]] * len(df)

Note: Similarly, df['e5'] = [set()] * len(df) also took 28ms.

Solution 3:

Canonical solutions: List comprehension, map and apply

Obligatory disclaimer: avoid using lists in pandas columns where possible, list columns are slow to work with because they are objects and those are inherently hard to vectorize.

With that out of the way, here are the canonical methods of introducing a column of empty lists:

# List comprehension
df['c'] = [[] for _ in range(df.shape[0])]
df

   a  b   c
0  1  5  []
1  2  6  []
2  3  7  []

There's also these shorthands involving apply and map:

from collections import defaultdict
# map any column with defaultdict
df['c'] = df.iloc[:,0].map(defaultdict(list))
# same as,
df['c'] = df.iloc[:,0].map(lambda _: [])

# apply with defaultdict
df['c'] = df.apply(defaultdict(list), axis=1) 
# same as,
df['c'] = df.apply(lambda _: [], axis=1)

df

   a  b   c
0  1  5  []
1  2  6  []
2  3  7  []

Things you should NOT do

Some folks believe multiplying an empty list is the way to go, unfortunately this is wrong and will usually lead to hard-to-debug issues. Here's an MVP:

# WRONG
df['c'] = [[]] * len(df) 
df.at[0, 'c'].append('abc')
df.at[1, 'c'].append('def')
df

   a  b           c
0  1  5  [abc, def]
1  2  6  [abc, def]
2  3  7  [abc, def]

# RIGHT
df['c'] = [[] for _ in range(df.shape[0])]
df.at[0, 'c'].append('abc')
df.at[1, 'c'].append('def')
df

a  b      c
0  1  5  [abc]
1  2  6  [def]
2  3  7     []

In the first case, a single empty list is created and its reference is replicated across all the rows, so you see updates to one reflected to all of them. In the latter case each row is assigned its own empty list, so this is not a concern.