Efficiently drop rows in a Pandas Dataframe, where you need to tokenize the text of a column first in order to pass a conditional statement
I have a csv file which is around 2 gigabytes and I have it stored in a Pandas Dataframe called data. The removal of the rows depends on the text which is held in a column called doc_info. More specifically, I want to remove the rows where their text in the doc_info column has less than 20 words.
The code that I've used is the following:
for index, row in data.iterrows():
tokenized_doc_info = row.doc_info.split()
if len(tokenized_doc_info) < 20:
data.drop(index, inplace=True)
However, the above code could not complete, even after 7 hours and thus I interrupted it. Could you provide me with a better solution or explain me why this code is so slow?
Thank you
You almost never want to use iteration over pandas DataFrame because they have C-optimized counterparts. Use built-in functions instead:
data[data.str.doc_info.split() > 20]
This retrieves the sub-dataframe of records where doc_info has less than 20 words (as defined by a space separation). This would be drastically faster.
Let us try
out = data[data.doc_info.str.split().str.len() < 20]
Or
out = data[data.doc_info.str.count(' ') < 20+1]