Pythonic/efficient way to strip whitespace from every Pandas Data frame cell that has a stringlike object in it

Stumbled onto this question while looking for a quick and minimalistic snippet I could use. Had to assemble one myself from posts above. Maybe someone will find it useful:

data_frame_trimmed = data_frame.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

You could use pandas' Series.str.strip() method to do this quickly for each string-like column:

>>> data = pd.DataFrame({'values': ['   ABC   ', '   DEF', '  GHI  ']})
>>> data
      values
0     ABC   
1        DEF
2      GHI  

>>> data['values'].str.strip()
0    ABC
1    DEF
2    GHI
Name: values, dtype: object

We want to:

Apply our function to each element in our dataframe - use applymap.
Use type(x)==str (versus x.dtype == 'object') because Pandas will label columns as object for columns of mixed datatypes (an object column may contain int and/or str).
Maintain the datatype of each element (we don't want to convert everything to a str and then strip whitespace).

Therefore, I've found the following to be the easiest:

df.applymap(lambda x: x.strip() if type(x)==str else x)

When you call pandas.read_csv, you can use a regular expression that matches zero or more spaces followed by a comma followed by zero or more spaces as the delimiter.

For example, here's "data.csv":

In [19]: !cat data.csv
1.5, aaa,  bbb ,  ddd     , 10 ,  XXX   
2.5, eee, fff  ,       ggg, 20 ,     YYY

(The first line ends with three spaces after XXX, while the second line ends at the last Y.)

The following uses pandas.read_csv() to read the files, with the regular expression ' *, *' as the delimiter. (Using a regular expression as the delimiter is only available in the "python" engine of read_csv().)

In [20]: import pandas as pd

In [21]: df = pd.read_csv('data.csv', header=None, delimiter=' *, *', engine='python')

In [22]: df
Out[22]: 
     0    1    2    3   4    5
0  1.5  aaa  bbb  ddd  10  XXX
1  2.5  eee  fff  ggg  20  YYY

Pythonic/efficient way to strip whitespace from every Pandas Data frame cell that has a stringlike object in it

Related

Recent Posts