spacy with joblib library generates _pickle.PicklingError: Could not pickle the task to send it to the workers

I have a large list of sentences (~7 millions), and I want to extract the nouns from them.

I used joblib library to parallelize the extracting process, like in the following:

import spacy
from tqdm import tqdm
from joblib import Parallel, delayed
nlp = spacy.load('en_core_web_sm')

class nouns:

    def get_nouns(self, text):
        doc = nlp(u"{}".format(text))
        return [token.text for token in doc if token.tag_ in ['NN', 'NNP', 'NNS', 'NNPS']]

    def parallelize(self, sentences):
        results = Parallel(n_jobs=1)(delayed(self.get_nouns)(sent) for sent in tqdm(sentences))
        return results

if __name__ == '__main__':
    sentences = ['we went to the school yesterday',
                 'The weather is really cold',
                 'Can we catch the dog?',
                 'How old are you John?',
                 'I like diving and swimming',
                 'Can the world become united?']
    obj = nouns()
    print(obj.parallelize(sentences))

when n_jobs in parallelize function is more than 1, I get this long error:

100%|██████████| 6/6 [00:00<00:00, 200.00it/s]
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\queues.py", line 150, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 243, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 236, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 267, in dump
    return Pickler.dump(self, obj)
  File "C:\Python35\lib\pickle.py", line 408, in dump
    self.save(obj)
  File "C:\Python35\lib\pickle.py", line 520, in save
    self.save_reduce(obj=obj, *rv)
  File "C:\Python35\lib\pickle.py", line 623, in save_reduce
    save(state)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 810, in save_dict
    self._batch_setitems(obj.items())
  File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
    save(v)
  File "C:\Python35\lib\pickle.py", line 520, in save
    self.save_reduce(obj=obj, *rv)
  File "C:\Python35\lib\pickle.py", line 623, in save_reduce
    save(state)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 810, in save_dict
    self._batch_setitems(obj.items())
  File "C:\Python35\lib\pickle.py", line 841, in _batch_setitems
    save(v)
  File "C:\Python35\lib\pickle.py", line 520, in save
    self.save_reduce(obj=obj, *rv)
  File "C:\Python35\lib\pickle.py", line 623, in save_reduce
    save(state)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 810, in save_dict
    self._batch_setitems(obj.items())
  File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
    save(v)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 770, in save_list
    self._batch_appends(obj)
  File "C:\Python35\lib\pickle.py", line 797, in _batch_appends
    save(tmp[0])
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 725, in save_tuple
    save(element)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 718, in save_instancemethod
    self.save_reduce(types.MethodType, (obj.__func__, obj.__self__), obj=obj)
  File "C:\Python35\lib\pickle.py", line 599, in save_reduce
    save(args)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 725, in save_tuple
    save(element)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 395, in save_function
    self.save_function_tuple(obj)
  File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 594, in save_function_tuple
    save(state)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 810, in save_dict
    self._batch_setitems(obj.items())
  File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
    save(v)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 810, in save_dict
    self._batch_setitems(obj.items())
  File "C:\Python35\lib\pickle.py", line 841, in _batch_setitems
    save(v)
  File "C:\Python35\lib\pickle.py", line 520, in save
    self.save_reduce(obj=obj, *rv)
  File "C:\Python35\lib\pickle.py", line 623, in save_reduce
    save(state)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 810, in save_dict
    self._batch_setitems(obj.items())
  File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
    save(v)
  File "C:\Python35\lib\pickle.py", line 520, in save
    self.save_reduce(obj=obj, *rv)
  File "C:\Python35\lib\pickle.py", line 599, in save_reduce
    save(args)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 740, in save_tuple
    save(element)
  File "C:\Python35\lib\pickle.py", line 520, in save
    self.save_reduce(obj=obj, *rv)
  File "C:\Python35\lib\pickle.py", line 623, in save_reduce
    save(state)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 740, in save_tuple
    save(element)
  File "C:\Python35\lib\pickle.py", line 495, in save
    rv = reduce(self.proto)
  File "stringsource", line 2, in preshed.maps.PreshMap.__reduce_cython__
TypeError: self.c_map cannot be converted to a Python object for pickling
"""Exception in thread QueueFeederThread:
Traceback (most recent call last):
  File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\queues.py", line 150, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 243, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 236, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 267, in dump
    return Pickler.dump(self, obj)
  File "C:\Python35\lib\pickle.py", line 408, in dump
    self.save(obj)
  File "C:\Python35\lib\pickle.py", line 520, in save
    self.save_reduce(obj=obj, *rv)
  File "C:\Python35\lib\pickle.py", line 623, in save_reduce
    save(state)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 810, in save_dict
    self._batch_setitems(obj.items())
  File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
    save(v)
  File "C:\Python35\lib\pickle.py", line 520, in save
    self.save_reduce(obj=obj, *rv)
  File "C:\Python35\lib\pickle.py", line 623, in save_reduce
    save(state)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 810, in save_dict
    self._batch_setitems(obj.items())
  File "C:\Python35\lib\pickle.py", line 841, in _batch_setitems
    save(v)
  File "C:\Python35\lib\pickle.py", line 520, in save
    self.save_reduce(obj=obj, *rv)
  File "C:\Python35\lib\pickle.py", line 623, in save_reduce
    save(state)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 810, in save_dict
    self._batch_setitems(obj.items())
  File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
    save(v)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 770, in save_list
    self._batch_appends(obj)
  File "C:\Python35\lib\pickle.py", line 797, in _batch_appends
    save(tmp[0])
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 725, in save_tuple
    save(element)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 718, in save_instancemethod
    self.save_reduce(types.MethodType, (obj.__func__, obj.__self__), obj=obj)
  File "C:\Python35\lib\pickle.py", line 599, in save_reduce
    save(args)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 725, in save_tuple
    save(element)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 395, in save_function
    self.save_function_tuple(obj)
  File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 594, in save_function_tuple
    save(state)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 810, in save_dict
    self._batch_setitems(obj.items())
  File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
    save(v)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 810, in save_dict
    self._batch_setitems(obj.items())
  File "C:\Python35\lib\pickle.py", line 841, in _batch_setitems
    save(v)
  File "C:\Python35\lib\pickle.py", line 520, in save
    self.save_reduce(obj=obj, *rv)
  File "C:\Python35\lib\pickle.py", line 623, in save_reduce
    save(state)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 810, in save_dict
    self._batch_setitems(obj.items())
  File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
    save(v)
  File "C:\Python35\lib\pickle.py", line 520, in save
    self.save_reduce(obj=obj, *rv)
  File "C:\Python35\lib\pickle.py", line 599, in save_reduce
    save(args)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 740, in save_tuple
    save(element)
  File "C:\Python35\lib\pickle.py", line 520, in save
    self.save_reduce(obj=obj, *rv)
  File "C:\Python35\lib\pickle.py", line 623, in save_reduce
    save(state)
  File "C:\Python35\lib\pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python35\lib\pickle.py", line 740, in save_tuple
    save(element)
  File "C:\Python35\lib\pickle.py", line 495, in save
    rv = reduce(self.proto)
  File "stringsource", line 2, in preshed.maps.PreshMap.__reduce_cython__
TypeError: self.c_map cannot be converted to a Python object for pickling

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python35\lib\threading.py", line 914, in _bootstrap_inner
    self.run()
  File "C:\Python35\lib\threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\queues.py", line 175, in _feed
    onerror(e, obj)
  File "C:\Python35\lib\site-packages\joblib\externals\loky\process_executor.py", line 310, in _on_queue_feeder_error
    self.thread_wakeup.wakeup()
  File "C:\Python35\lib\site-packages\joblib\externals\loky\process_executor.py", line 155, in wakeup
    self._writer.send_bytes(b"")
  File "C:\Python35\lib\multiprocessing\connection.py", line 183, in send_bytes
    self._check_closed()
  File "C:\Python35\lib\multiprocessing\connection.py", line 136, in _check_closed
    raise OSError("handle is closed")
OSError: handle is closed



The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File ".../playground.py", line 43, in <module>
    print(obj.Paralize(sentences))
  File ".../playground.py", line 32, in Paralize
    results = Parallel(n_jobs=2)(delayed(self.get_nouns)(sent) for sent in tqdm(sentences))
  File "C:\Python35\lib\site-packages\joblib\parallel.py", line 934, in __call__
    self.retrieve()
  File "C:\Python35\lib\site-packages\joblib\parallel.py", line 833, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "C:\Python35\lib\site-packages\joblib\_parallel_backends.py", line 521, in wrap_future_result
    return future.result(timeout=timeout)
  File "C:\Python35\lib\concurrent\futures\_base.py", line 405, in result
    return self.__get_result()
  File "C:\Python35\lib\concurrent\futures\_base.py", line 357, in __get_result
    raise self._exception
_pickle.PicklingError: Could not pickle the task to send it to the workers.

What is the problem in my code?


Solution 1:

Q: What is the problem in my code?

Well, most probably the issue comes not from the code, but from the "hidden" processing, that appears, once n_jobs directs ( and joblib internally orchestrates ) to prepare that many exact copies of the main process, so as to let them work independently one of each other ( effectively thus escaping from GIL-locking and mapping the multiple process-flows onto physical hardware resources )

This step is responsible for making copies of all pythonic objects and was known to use Pickle for doing this. The Pickle module was known for its historical principal limitations on what can be pickled and what cannot.

The error message confirms this:

TypeError: self.c_map cannot be converted to a Python object for pickling

One may try a trick to supply Mike McKearns dill module instead of Pickle and test, if your "problematic" python objects will get pickled with this module without throwing this error.

dill has the same API signatures, so a pure import dill as pickle may help with leaving all the other code the same.

I had the same problems, with large models to get distributed into and back from multiple processes and the dill was a way to go. Also the performance has increased.

Bonus: dill allows to save / restore the full python interpreter state!

This was a cool side-effect of finding dill, once import dill as pickle was done, pickle.dump_session( <aFile> ) will save ones complete state-full copy of the python interpreter session. This can be restored, if needed ( post-crash restores, trained trained and optimised ML-model state-fully saved / restored, incremental learning ML-model state-fully saved and re-distributed for remote restores for the deployed user-bases, etc. )

Solution 2:

Same issue. I solved by changing the backend from loky to threading in Parallel.