Python: Pandas filter string data based on its string length

I like to filter out data whose string length is not equal to 10.

If I try to filter out any row whose column A's or B's string length is not equal to 10, I tried this.

df=pd.read_csv('filex.csv')
df.A=df.A.apply(lambda x: x if len(x)== 10 else np.nan)
df.B=df.B.apply(lambda x: x if len(x)== 10 else np.nan)
df=df.dropna(subset=['A','B'], how='any')

This works slow, but is working.

However, it sometimes produce error when the data in A is not a string but a number (interpreted as a number when read_csv read the input file).

  File "<stdin>", line 1, in <lambda>
TypeError: object of type 'float' has no len()

I believe there should be more efficient and elegant code instead of this.

Based on the answers and comments below, the simplest solution I found are:

df=df[df.A.apply(lambda x: len(str(x))==10]
df=df[df.B.apply(lambda x: len(str(x))==10]

df=df[(df.A.apply(lambda x: len(str(x))==10) & (df.B.apply(lambda x: len(str(x))==10)]

df=df[(df.A.astype(str).str.len()==10) & (df.B.astype(str).str.len()==10)]

import pandas as pd

df = pd.read_csv('filex.csv')
df['A'] = df['A'].astype('str')
df['B'] = df['B'].astype('str')
mask = (df['A'].str.len() == 10) & (df['B'].str.len() == 10)
df = df.loc[mask]
print(df)

Applied to filex.csv:

A,B
123,abc
1234,abcd
1234567890,abcdefghij

the code above prints

            A           B
2  1234567890  abcdefghij

A more Pythonic way of filtering out rows based on given conditions of other columns and their values:

Assuming a df of:

data={"names":["Alice","Zac","Anna","O"],"cars":["Civic","BMW","Mitsubishi","Benz"],
     "age":["1","4","2","0"]}

df=pd.DataFrame(data)
df:
  age        cars  names
0   1       Civic  Alice
1   4         BMW    Zac
2   2  Mitsubishi   Anna
3   0        Benz      O

Then:

df[
df['names'].apply(lambda x: len(x)>1) &
df['cars'].apply(lambda x: "i" in x) &
df['age'].apply(lambda x: int(x)<2)
  ]

We will have :

  age   cars  names
0   1  Civic  Alice

In the conditions above we are looking first at the length of strings, then we check whether a letter ("i") exists in the strings or not, finally, we check for the value of integers in the first column.

I personally found this way to be the easiest:

df['column_name'] = df[df['column_name'].str.len()!=10]

If You have numbers in rows, then they will convert as floats.

Convert all the rows to strings after importing from cvs. For better performance split that lambdas into multiple threads.

Python: Pandas filter string data based on its string length

Related

Recent Posts