Meaning of inter_op_parallelism_threads and intra_op_parallelism_threads
Solution 1:
The inter_op_parallelism_threads
and intra_op_parallelism_threads
options are documented in the source of the tf.ConfigProto
protocol buffer. These options configure two thread pools used by TensorFlow to parallelize execution, as the comments describe:
// The execution of an individual op (for some op types) can be
// parallelized on a pool of intra_op_parallelism_threads.
// 0 means the system picks an appropriate number.
int32 intra_op_parallelism_threads = 2;
// Nodes that perform blocking operations are enqueued on a pool of
// inter_op_parallelism_threads available in each process.
//
// 0 means the system picks an appropriate number.
//
// Note that the first Session created in the process sets the
// number of threads for all future sessions unless use_per_session_threads is
// true or session_inter_op_thread_pool is configured.
int32 inter_op_parallelism_threads = 5;
There are several possible forms of parallelism when running a TensorFlow graph, and these options provide some control multi-core CPU parallelism:
If you have an operation that can be parallelized internally, such as matrix multiplication (
tf.matmul()
) or a reduction (e.g.tf.reduce_sum()
), TensorFlow will execute it by scheduling tasks in a thread pool withintra_op_parallelism_threads
threads. This configuration option therefore controls the maximum parallel speedup for a single operation. Note that if you run multiple operations in parallel, these operations will share this thread pool.If you have many operations that are independent in your TensorFlow graph—because there is no directed path between them in the dataflow graph—TensorFlow will attempt to run them concurrently, using a thread pool with
inter_op_parallelism_threads
threads. If those operations have a multithreaded implementation, they will (in most cases) share the same thread pool for intra-op parallelism.
Finally, both configuration options take a default value of 0
, which means "the system picks an appropriate number." Currently, this means that each thread pool will have one thread per CPU core in your machine.
Solution 2:
To get the best performance from a machine, change the parallelism threads and OpenMP settings as below for the tensorflow backend (from here):
import tensorflow as tf
#Assume that the number of cores per socket in the machine is denoted as NUM_PARALLEL_EXEC_UNITS
# when NUM_PARALLEL_EXEC_UNITS=0 the system chooses appropriate settings
config = tf.ConfigProto(intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS,
inter_op_parallelism_threads=2,
allow_soft_placement=True,
device_count = {'CPU': NUM_PARALLEL_EXEC_UNITS})
session = tf.Session(config=config)
Answer to the comment bellow: [source]
allow_soft_placement=True
If you would like TensorFlow to automatically choose an existing and supported device to run the operations in case the specified one doesn't exist, you can set allow_soft_placement
to True in the configuration option when creating the session. In simple words it allows dynamic allocation of GPU memory.
Solution 3:
Tensorflow 2.0 Compatible Answer: If we want to execute in Graph Mode of Tensorflow Version 2.0
, the function in which we can configure inter_op_parallelism_threads
and intra_op_parallelism_threads
is
tf.compat.v1.ConfigProto
.