Multiprocessing and Threading in Python
i'm trying to handle multiprocessing in python, however, i think i might did not understand it properly.
To start with, i have dataframe, which contains texts as string, on which i want to perform some regex. The code looks as follows:
import multiprocess
from threading import Thread
def clean_qa():
for index, row in data.iterrows():
data["qa"].loc[index] = re.sub("(\-{5,}).{1,100}(\-{5,})|(\[.{1,50}\])|[^\w\s]", "", str(data["qa"].loc[index]))
if __name__ == '__main__':
threads = []
for i in range(os.cpu_count()):
threads.append(Thread(target=test_qa))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
if __name__ == '__main__':
processes = []
for i in range(os.cpu_count()):
processes.append(multiprocess.Process(target=test_qa))
for process in processes:
process.start()
for process in processes:
process.join()
When i run the function "clean_qa" not as function but simply by executing the for loop, everything works fine and it takes about 3 minutes.
However, when i use multiprocessing or threading, first of all, the execution takes about 10 minutes, and the text is not cleaned, so the dataframe is as before.
Therefore my question, what did i do wrong, why does it take longer and why does nothing happen to the dataframe?
Thank you very much!
This is slightly beside the point (though my comments in the original post do address the actual points), but since you're working with a Pandas dataframe, you really never want to loop over it by hand.
Looks like all you actually want here is just:
r = re.compile(r"(\-{5,}).{1,100}(\-{5,})|(\[.{1,50}\])|[^\w\s]")
def clean_qa():
data["qa"] = data["qa"].str.replace(r, "")
to let Pandas deal with the looping and parallelization.