spacy with joblib library generates _pickle.PicklingError: Could not pickle the task to send it to the workers
I have a large list of sentences (~7 millions), and I want to extract the nouns from them.
I used joblib
library to parallelize the extracting process, like in the following:
import spacy
from tqdm import tqdm
from joblib import Parallel, delayed
nlp = spacy.load('en_core_web_sm')
class nouns:
def get_nouns(self, text):
doc = nlp(u"{}".format(text))
return [token.text for token in doc if token.tag_ in ['NN', 'NNP', 'NNS', 'NNPS']]
def parallelize(self, sentences):
results = Parallel(n_jobs=1)(delayed(self.get_nouns)(sent) for sent in tqdm(sentences))
return results
if __name__ == '__main__':
sentences = ['we went to the school yesterday',
'The weather is really cold',
'Can we catch the dog?',
'How old are you John?',
'I like diving and swimming',
'Can the world become united?']
obj = nouns()
print(obj.parallelize(sentences))
when n_jobs
in parallelize function is more than 1, I get this long error:
100%|██████████| 6/6 [00:00<00:00, 200.00it/s]
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\queues.py", line 150, in _feed
obj_ = dumps(obj, reducers=reducers)
File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 243, in dumps
dump(obj, buf, reducers=reducers, protocol=protocol)
File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 236, in dump
_LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 267, in dump
return Pickler.dump(self, obj)
File "C:\Python35\lib\pickle.py", line 408, in dump
self.save(obj)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 841, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 770, in save_list
self._batch_appends(obj)
File "C:\Python35\lib\pickle.py", line 797, in _batch_appends
save(tmp[0])
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 725, in save_tuple
save(element)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 718, in save_instancemethod
self.save_reduce(types.MethodType, (obj.__func__, obj.__self__), obj=obj)
File "C:\Python35\lib\pickle.py", line 599, in save_reduce
save(args)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 725, in save_tuple
save(element)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 395, in save_function
self.save_function_tuple(obj)
File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 594, in save_function_tuple
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 841, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 599, in save_reduce
save(args)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 740, in save_tuple
save(element)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 740, in save_tuple
save(element)
File "C:\Python35\lib\pickle.py", line 495, in save
rv = reduce(self.proto)
File "stringsource", line 2, in preshed.maps.PreshMap.__reduce_cython__
TypeError: self.c_map cannot be converted to a Python object for pickling
"""Exception in thread QueueFeederThread:
Traceback (most recent call last):
File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\queues.py", line 150, in _feed
obj_ = dumps(obj, reducers=reducers)
File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 243, in dumps
dump(obj, buf, reducers=reducers, protocol=protocol)
File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 236, in dump
_LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 267, in dump
return Pickler.dump(self, obj)
File "C:\Python35\lib\pickle.py", line 408, in dump
self.save(obj)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 841, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 770, in save_list
self._batch_appends(obj)
File "C:\Python35\lib\pickle.py", line 797, in _batch_appends
save(tmp[0])
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 725, in save_tuple
save(element)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 718, in save_instancemethod
self.save_reduce(types.MethodType, (obj.__func__, obj.__self__), obj=obj)
File "C:\Python35\lib\pickle.py", line 599, in save_reduce
save(args)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 725, in save_tuple
save(element)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 395, in save_function
self.save_function_tuple(obj)
File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 594, in save_function_tuple
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 841, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 599, in save_reduce
save(args)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 740, in save_tuple
save(element)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 740, in save_tuple
save(element)
File "C:\Python35\lib\pickle.py", line 495, in save
rv = reduce(self.proto)
File "stringsource", line 2, in preshed.maps.PreshMap.__reduce_cython__
TypeError: self.c_map cannot be converted to a Python object for pickling
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python35\lib\threading.py", line 914, in _bootstrap_inner
self.run()
File "C:\Python35\lib\threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\queues.py", line 175, in _feed
onerror(e, obj)
File "C:\Python35\lib\site-packages\joblib\externals\loky\process_executor.py", line 310, in _on_queue_feeder_error
self.thread_wakeup.wakeup()
File "C:\Python35\lib\site-packages\joblib\externals\loky\process_executor.py", line 155, in wakeup
self._writer.send_bytes(b"")
File "C:\Python35\lib\multiprocessing\connection.py", line 183, in send_bytes
self._check_closed()
File "C:\Python35\lib\multiprocessing\connection.py", line 136, in _check_closed
raise OSError("handle is closed")
OSError: handle is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File ".../playground.py", line 43, in <module>
print(obj.Paralize(sentences))
File ".../playground.py", line 32, in Paralize
results = Parallel(n_jobs=2)(delayed(self.get_nouns)(sent) for sent in tqdm(sentences))
File "C:\Python35\lib\site-packages\joblib\parallel.py", line 934, in __call__
self.retrieve()
File "C:\Python35\lib\site-packages\joblib\parallel.py", line 833, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "C:\Python35\lib\site-packages\joblib\_parallel_backends.py", line 521, in wrap_future_result
return future.result(timeout=timeout)
File "C:\Python35\lib\concurrent\futures\_base.py", line 405, in result
return self.__get_result()
File "C:\Python35\lib\concurrent\futures\_base.py", line 357, in __get_result
raise self._exception
_pickle.PicklingError: Could not pickle the task to send it to the workers.
What is the problem in my code?
Solution 1:
Q: What is the problem in my code?
Well, most probably the issue comes not from the code, but from the "hidden" processing, that appears, once n_jobs
directs ( and joblib
internally orchestrates ) to prepare that many exact copies of the main process, so as to let them work independently one of each other ( effectively thus escaping from GIL-locking and mapping the multiple process-flows onto physical hardware resources )
This step is responsible for making copies of all pythonic objects and was known to use Pickle
for doing this. The Pickle
module was known for its historical principal limitations on what can be pickled and what cannot.
The error message confirms this:
TypeError: self.c_map cannot be converted to a Python object for pickling
One may try a trick to supply Mike McKearns dill
module instead of Pickle
and test, if your "problematic" python objects will get pickled with this module without throwing this error.
dill
has the same API signatures, so a pure import dill as pickle
may help with leaving all the other code the same.
I had the same problems, with large models to get distributed into and back from multiple processes and the dill
was a way to go. Also the performance has increased.
Bonus:
dill
allows to save / restore the full python interpreter state!
This was a cool side-effect of finding dill
, once import dill as pickle
was done, pickle.dump_session( <aFile> )
will save ones complete state-full copy of the python interpreter session. This can be restored, if needed ( post-crash restores, trained trained and optimised ML-model state-fully saved / restored, incremental learning ML-model state-fully saved and re-distributed for remote restores for the deployed user-bases, etc. )
Solution 2:
Same issue. I solved by changing the backend from loky
to threading
in Parallel
.