Pandas DataFrame stored list as string: How to convert back to list
I have an n-by-m Pandas DataFrame df
defined as follows. (I know this is not the best way to do it. It makes sense for what I'm trying to do in my actual code, but that would be TMI for this post so just take my word that this approach works in my particular scenario.)
>>> df = DataFrame(columns=['col1'])
>>> df.append(Series([None]), ignore_index=True)
>>> df
Empty DataFrame
Columns: [col1]
Index: []
I stored lists in the cells of this DataFrame as follows.
>>> df['column1'][0] = [1.23, 2.34]
>>> df
col1
0 [1, 2]
For some reason, the DataFrame stored this list as a string instead of a list.
>>> df['column1'][0]
'[1.23, 2.34]'
I have 2 questions for you.
- Why does the DataFrame store a list as a string and is there a way around this behavior?
- If not, then is there a Pythonic way to convert this string into a list?
Update
The DataFrame I was using had been saved and loaded from a CSV format. This format, rather than the DataFrame itself, converted the list from a string to a literal.
As you pointed out, this can commonly happen when saving and loading pandas DataFrames as .csv
files, which is a text format.
In your case this happened because list objects have a string representation, allowing them to be stored as .csv
files. Loading the .csv
will then yield that string representation.
If you want to store the actual objects, you should use DataFrame.to_pickle()
(note: objects must be picklable!).
To answer your second question, you can convert it back with ast.literal_eval
:
>>> from ast import literal_eval
>>> literal_eval('[1.23, 2.34]')
[1.23, 2.34]
You can directly use pandas -
import pandas as pd
df = pd.read_csv(DF_NAME, converters={'COLUMN_NAME': pd.eval})
This will read that column as a it's corresponding dtype in python instead of a string.
UPDATE:
As rightfully pointed out by @ctwardy in comments. It is wiser to use pd.eval
instead of eval
to avoid unintended regex related consequences. Details - https://realpython.com/python-eval-function/#minimizing-the-security-issues-of-eval
- Use
ast.literal_eval
to safely evaluate a string containing a Python literal or container datatype.-
It's part of the standard library
-
Using python's eval() vs. ast.literal_eval()? explains why
literal_eval
is safer than usingeval
. -
Examples:
-
literal_eval("[1.23, 2.34]")
works -
literal_eval("['KB4523205','KB4519569','KB4503308']")
works- Other answers mention
pd.eval
, but its usage is limited; it results inValueError: NumExpr 2 does not support Unicode as a dtype.
for this simple example.
- Other answers mention
-
literal_eval("[KB4523205, KB4519569, KB4503308]")
doesn't work (no quotes around thestr
values)- See pandas - convert string into list of strings for dealing with this representation.
-
-
- Convert the column when reading the file, by using the
converters
parameter ofpandas.read_csv
.
data in test.csv
col1
"[1.23, 2.34]"
"['KB4523205','KB4519569','KB4503308']"
Convert the column when creating the csv
from ast import literal_eval
import pandas as pd
# convert the column during import
df = pd.read_csv('test.csv', converters={'col1': literal_eval})
# display(df)
col1
0 [1.23, 2.34]
1 [KB4523205, KB4519569, KB4503308]
# check type
print(type(df.iloc[0, 0]))
list
print(type(df.iloc[1, 0]))
list
Convert the column of an existing dataframe
df.col1 = df.col1.apply(literal_eval)
%%timeit
-
pd.eval
is 28 times slower thanliteral_eval
- Given
test.csv
with 2,820,511 rows of"[1.23, 2.34]"
I just came across this problem and there is a very simple solution (pandas.eval()). I'm using pandas 0.20.0.
# SETUP
import pandas as pd
import io
csv = io.StringIO(u'''
id list
A1 [1,2]
A2 [3,4]
A3 [5,6]
''')
df = pd.read_csv(csv, delim_whitespace = True)
# TYPE CHECK <type 'str'>
print type(df.at[0, 'list'])
# MAIN CONVERSION
df['list'] = pd.eval(df['list'])
# TYPE CHECK <type 'list'>
print type(df.at[0, 'list'])
1) There is a way around this behavior. Use loc helps here.
>>> import pandas as pd
>>> df = pd.DataFrame(columns=['column1'])
>>> df = df.append(pd.Series(data = {'column1':[None]}), ignore_index = True)
column1
0 [None]
>>> # Add list to index 0 in column1
>>> df.loc[0,'column1'] = [1.23, 2.34]
>>> print(df.loc[0, 'column1'])
[1.23, 2.34]
2) Pythonic way to convert this string into a list. (This is probably what you want as the DataFrame you are using had been been saved and loaded from a CSV format, there are a couple solutions for this). This is an addition on pshep123's answer.
from ast import literal_eval
import pandas as pd
csv = io.StringIO(u'''
id list
A1 [1,2]
A2 [3,4]
A3 [5,6]
''')
df = pd.read_csv(csv, delim_whitespace = True)
# Output is a string
df.loc[0, 'list']
'[1,2]'
# Convert entire column to a list
df.loc[:,'list'] = df.loc[:,'list'].apply(lambda x: literal_eval(x))
# Output is a list
df.loc[0, 'list']
[1, 2]