How to set the value of a pandas column as list
I want to set the value of a pandas column as a list of strings. However, my efforts to do so didn't succeed because pandas take the column value as an iterable and I get a: ValueError: Must have equal len keys and value when setting with an iterable
.
Here is an MWE
>> df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
>> df
col1 col2
0 1 4
1 2 5
2 3 6
>> df['new_col'] = None
>> df.loc[df.col1 == 1, 'new_col'] = ['a', 'b']
ValueError: Must have equal len keys and value when setting with an iterable
I tried to set the dtype
as list
using df.new_col = df.new_col.astype(list)
and that didn't work either.
I am wondering what would be the correct approach here.
EDIT
The answer provided here: Python pandas insert list into a cell using at
didn't work for me either.
Not easy, one possible solution is create helper Series
:
df.loc[df.col1 == 1, 'new_col'] = pd.Series([['a', 'b']] * len(df))
print (df)
col1 col2 new_col
0 1 4 [a, b]
1 2 5 NaN
2 3 6 NaN
Another solution, if need set missing values to empty list too is use list comprehension:
#df['new_col'] = [['a', 'b'] if x == 1 else np.nan for x in df['col1']]
df['new_col'] = [['a', 'b'] if x == 1 else [] for x in df['col1']]
print (df)
col1 col2 new_col
0 1 4 [a, b]
1 2 5 []
2 3 6 []
But then you lose the vectorised functionality which goes with using NumPy arrays held in contiguous memory blocks.
Don't do this.
Pandas was never designed to hold lists in series / columns. You can concoct expensive workarounds, but these are not recommended.
The main reason holding lists in series is not recommended is you lose the vectorised functionality which goes with using NumPy arrays held in contiguous memory blocks. Your series will be of object
dtype, which represents a sequence of pointers, much like list
. You will lose benefits in terms of memory and performance, as well as access to optimized Pandas methods.
See also What are the advantages of NumPy over regular Python lists? The arguments in favour of Pandas are the same as for NumPy.
That said, since you are going against the purpose and design of Pandas, there are many who face the same problem and have asked similar questions:
- Python pandas insert list into a cell
- pandas: how to store a list in a dataframe?
- Answer on this question