Parallelism isn't reducing the time in dataset map
TF Map function supports parallel calls. I'm seeing no improvements passing num_parallel_calls
to map. With num_parallel_calls=1
and num_parallel_calls=10
, there is no improvement in performance run time. Here is a simple code
import time
def test_two_custom_function_parallelism(num_parallel_calls=1, batch=False,
batch_size=1, repeat=1, num_iterations=10):
tf.reset_default_graph()
start = time.time()
dataset_x = tf.data.Dataset.range(1000).map(lambda x: tf.py_func(
squarer, [x], [tf.int64]),
num_parallel_calls=num_parallel_calls).repeat(repeat)
if batch:
dataset_x = dataset_x.batch(batch_size)
dataset_y = tf.data.Dataset.range(1000).map(lambda x: tf.py_func(
squarer, [x], [tf.int64]), num_parallel_calls=num_parallel_calls).repeat(repeat)
if batch:
dataset_y = dataset_x.batch(batch_size)
X = dataset_x.make_one_shot_iterator().get_next()
Y = dataset_x.make_one_shot_iterator().get_next()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
i = 0
while True:
try:
res = sess.run([X, Y])
i += 1
if i == num_iterations:
break
except tf.errors.OutOfRangeError as e:
pass
Here are the timings
%timeit test_two_custom_function_parallelism(num_iterations=1000,
num_parallel_calls=2, batch_size=2, batch=True)
370ms
%timeit test_two_custom_function_parallelism(num_iterations=1000,
num_parallel_calls=5, batch_size=2, batch=True)
372ms
%timeit test_two_custom_function_parallelism(num_iterations=1000,
num_parallel_calls=10, batch_size=2, batch=True)
384ms
I used %timeit
in Juypter notebook. What am I doing it wrong?
Solution 1:
The problem here is that the only operation in the Dataset.map()
function is a tf.py_func()
op. This op calls back into the local Python interpreter to run a function in the same process. Increasing num_parallel_calls
will increase the number of TensorFlow threads that attempt to call back into Python concurrently. However, Python has something called the "Global Interpreter Lock" that prevents more than one thread from executing code at once. As a result, all but one of these multiple parallel calls will be blocked waiting to acquire the Global Interpreter Lock, and there will be almost no parallel speedup (and perhaps even a slight slowdown).
Your code example didn't include the definition of the squarer()
function, but it might be possible to replace tf.py_func()
with pure TensorFlow ops, which are implemented in C++, and can execute in parallel. For example—and just guessing by the name—you could replace it with an invocation of tf.square(x)
, and you might then enjoy some parallel speedup.
Note however that if the amount of work in the function is small, like squaring a single integer, the speedup might not be very large. Parallel Dataset.map()
is more useful for heavier operations, like parsing a TFRecord with tf.parse_single_example()
or performing some image distortions as part of a data augmentation pipeline.
Solution 2:
The reason maybe the squarer cost less time than overhead time. I modified the code with adding a quarter function which cost 2 seconds. Then the parameter num_parallel_calls works as expected. Here is the complete code:
import tensorflow as tf
import time
def squarer(x):
t0 = time.time()
while time.time() - t0 < 2:
y = x ** 2
return y
def test_two_custom_function_parallelism(num_parallel_calls=1,
batch=False,
batch_size=1,
repeat=1,
num_iterations=10):
tf.reset_default_graph()
start = time.time()
dataset_x = tf.data.Dataset.range(1000).map(
lambda x: tf.py_func(squarer, [x], [tf.int64]),
num_parallel_calls=num_parallel_calls).repeat(repeat)
# dataset_x = dataset_x.prefetch(4)
if batch:
dataset_x = dataset_x.batch(batch_size)
dataset_y = tf.data.Dataset.range(1000).map(
lambda x: tf.py_func(squarer, [x], [tf.int64]),
num_parallel_calls=num_parallel_calls).repeat(repeat)
# dataset_y = dataset_y.prefetch(4)
if batch:
dataset_y = dataset_x.batch(batch_size)
X = dataset_x.make_one_shot_iterator().get_next()
Y = dataset_x.make_one_shot_iterator().get_next()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
i = 0
while True:
t0 = time.time()
try:
res = sess.run([X, Y])
print(res)
i += 1
if i == num_iterations:
break
except tf.errors.OutOfRangeError as e:
print(i)
break
print('step elapse: %.4f' % (time.time() - t0))
print('total time: %.4f' % (time.time() - start))
test_two_custom_function_parallelism(
num_iterations=4, num_parallel_calls=1, batch_size=2, batch=True, repeat=10)
test_two_custom_function_parallelism(
num_iterations=4, num_parallel_calls=10, batch_size=2, batch=True, repeat=10)
the output is:
[(array([0, 1]),), (array([0, 1]),)]
step elapse: 4.0204
[(array([4, 9]),), (array([4, 9]),)]
step elapse: 4.0836
[(array([16, 25]),), (array([16, 25]),)]
step elapse: 4.1529
[(array([36, 49]),), (array([36, 49]),)]
total time: 16.3374
[(array([0, 1]),), (array([0, 1]),)]
step elapse: 2.2139
[(array([4, 9]),), (array([4, 9]),)]
step elapse: 0.0585
[(array([16, 25]),), (array([16, 25]),)]
step elapse: 0.0469
[(array([36, 49]),), (array([36, 49]),)]
total time: 2.5317
So I am confused with the effect of "Global Interpreter Lock" mentioned by @mrry.
Solution 3:
I setup my own version of map
to get something similar to the TensorFlow's Dataset.map
, but which will use multiple CPUs for py_function
s.
Usage
Instead of
mapped_dataset = my_dataset.map(lambda x: tf.py_function(my_function, [x], [tf.float64]), num_parallel_calls=16)
with the below code, you can get a CPU parallel py_function
version using
mapped_dataset = map_py_function_to_dataset(my_dataset, my_function, number_of_parallel_calls=16)
(The output type(s) for the py_function can also be specified if it's not a single tf.float32
)
Internally, this creates a pool of multiprocessing
workers. It still uses the single regular GIL limited TensorFlow map
, but only to pass the input to a worker and get the output back. The workers processing the data happen in parallel on the CPU.
Caveats
The function passed needs to be picklable to work with the multiprocessing
pool. This should work for most cases, but some closures or whatnot may fail. Packages like dill
might loosen this restriction, but I haven't looked into that.
If you pass an object's method as the function, you also need to be careful about how the object is duplicated across processes (each process will have its own copy of the object, so you can't rely on the attributes being shared).
As long as these considerations are kept in mind, this code should work for many cases.
Code
"""
Code for TensorFlow's `Dataset` class which allows for multiprocessing in CPU map functions.
"""
import multiprocessing
from typing import Callable, Union, List
import signal
import tensorflow as tf
class PyMapper:
"""
A class which allows for mapping a py_function to a TensorFlow dataset in parallel on CPU.
"""
def __init__(self, map_function: Callable, number_of_parallel_calls: int):
self.map_function = map_function
self.number_of_parallel_calls = number_of_parallel_calls
self.pool = multiprocessing.Pool(self.number_of_parallel_calls, self.pool_worker_initializer)
@staticmethod
def pool_worker_initializer():
"""
Used to initialize each worker process.
"""
# Corrects bug where worker instances catch and throw away keyboard interrupts.
signal.signal(signal.SIGINT, signal.SIG_IGN)
def send_to_map_pool(self, element_tensor):
"""
Sends the tensor element to the pool for processing.
:param element_tensor: The element to be processed by the pool.
:return: The output of the map function on the element.
"""
result = self.pool.apply_async(self.map_function, (element_tensor,))
mapped_element = result.get()
return mapped_element
def map_to_dataset(self, dataset: tf.data.Dataset,
output_types: Union[List[tf.dtypes.DType], tf.dtypes.DType] = tf.float32):
"""
Maps the map function to the passed dataset.
:param dataset: The dataset to apply the map function to.
:param output_types: The TensorFlow output types of the function to convert to.
:return: The mapped dataset.
"""
def map_py_function(*args):
"""A py_function wrapper for the map function."""
return tf.py_function(self.send_to_map_pool, args, output_types)
return dataset.map(map_py_function, self.number_of_parallel_calls)
def map_py_function_to_dataset(dataset: tf.data.Dataset, map_function: Callable, number_of_parallel_calls: int,
output_types: Union[List[tf.dtypes.DType], tf.dtypes.DType] = tf.float32
) -> tf.data.Dataset:
"""
A one line wrapper to allow mapping a parallel py function to a dataset.
:param dataset: The dataset whose elements the mapping function will be applied to.
:param map_function: The function to map to the dataset.
:param number_of_parallel_calls: The number of parallel calls of the mapping function.
:param output_types: The TensorFlow output types of the function to convert to.
:return: The mapped dataset.
"""
py_mapper = PyMapper(map_function=map_function, number_of_parallel_calls=number_of_parallel_calls)
mapped_dataset = py_mapper.map_to_dataset(dataset=dataset, output_types=output_types)
return mapped_dataset