Pythonic/efficient way to strip whitespace from every Pandas Data frame cell that has a stringlike object in it
Stumbled onto this question while looking for a quick and minimalistic snippet I could use. Had to assemble one myself from posts above. Maybe someone will find it useful:
data_frame_trimmed = data_frame.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
You could use pandas' Series.str.strip()
method to do this quickly for each string-like column:
>>> data = pd.DataFrame({'values': [' ABC ', ' DEF', ' GHI ']})
>>> data
values
0 ABC
1 DEF
2 GHI
>>> data['values'].str.strip()
0 ABC
1 DEF
2 GHI
Name: values, dtype: object
We want to:
Apply our function to each element in our dataframe - use
applymap
.Use
type(x)==str
(versusx.dtype == 'object'
) because Pandas will label columns asobject
for columns of mixed datatypes (anobject
column may containint
and/orstr
).- Maintain the datatype of each element (we don't want to convert everything to a
str
and then strip whitespace).
Therefore, I've found the following to be the easiest:
df.applymap(lambda x: x.strip() if type(x)==str else x)
When you call pandas.read_csv
, you can use a regular expression that matches zero or more spaces followed by a comma followed by zero or more spaces as the delimiter.
For example, here's "data.csv"
:
In [19]: !cat data.csv
1.5, aaa, bbb , ddd , 10 , XXX
2.5, eee, fff , ggg, 20 , YYY
(The first line ends with three spaces after XXX
, while the second line ends at the last Y
.)
The following uses pandas.read_csv()
to read the files, with the regular expression ' *, *'
as the delimiter. (Using a regular expression as the delimiter is only available in the "python" engine of read_csv()
.)
In [20]: import pandas as pd
In [21]: df = pd.read_csv('data.csv', header=None, delimiter=' *, *', engine='python')
In [22]: df
Out[22]:
0 1 2 3 4 5
0 1.5 aaa bbb ddd 10 XXX
1 2.5 eee fff ggg 20 YYY