Efficiently drop rows in a Pandas Dataframe, where you need to tokenize the text of a column first in order to pass a conditional statement

I have a csv file which is around 2 gigabytes and I have it stored in a Pandas Dataframe called data. The removal of the rows depends on the text which is held in a column called doc_info. More specifically, I want to remove the rows where their text in the doc_info column has less than 20 words.

The code that I've used is the following:

for index, row in data.iterrows():
   tokenized_doc_info = row.doc_info.split()
   if len(tokenized_doc_info) < 20:
      data.drop(index, inplace=True)

However, the above code could not complete, even after 7 hours and thus I interrupted it. Could you provide me with a better solution or explain me why this code is so slow?

Thank you

You almost never want to use iteration over pandas DataFrame because they have C-optimized counterparts. Use built-in functions instead:

data[data.str.doc_info.split() > 20]

This retrieves the sub-dataframe of records where doc_info has less than 20 words (as defined by a space separation). This would be drastically faster.

Let us try

out = data[data.doc_info.str.split().str.len() < 20]

out = data[data.doc_info.str.count(' ') < 20+1]

Efficiently drop rows in a Pandas Dataframe, where you need to tokenize the text of a column first in order to pass a conditional statement

Related

Recent Posts