What is the difference between cuda vs tensor cores?

Now only Tesla V100 and Titan V have tensor cores. Both GPUs have 5120 cuda cores where each core can perform up to 1 single precision multiply-accumulate operation (e.g. in fp32: x += y * z) per 1 GPU clock (e.g. Tesla V100 PCIe frequency is 1.38Gz).

Each tensor core perform operations on small matrices with size 4x4. Each tensor core can perform 1 matrix multiply-accumulate operation per 1 GPU clock. It multiplies two fp16 matrices 4x4 and adds the multiplication product fp32 matrix (size: 4x4) to accumulator (that is also fp32 4x4 matrix).

It is called mixed precision because input matrices are fp16 but multiplication result and accumulator are fp32 matrices.

Probably, the proper name would be just 4x4 matrix cores however NVIDIA marketing team decided to use "tensor cores".


GPU’s have always been good for machine learning. GPU cores were originally designed for physics and graphics computation, which involves matrix operations. General computing tasks do not require lots of matrix operations, so CPU’s are much slower at these. Physics and graphics are also far easier to parallelise than general computing tasks, leading to the high core count.

Due to the matrix heavy nature of machine learning (neural nets), GPU’s were a great fit. Tensor cores are just more heavily specialised to the types of computation involved in machine learning software (such as Tensorflow).

Nvidia have written a detailed blog here, which goes into far more detail on how Tensor cores work and the preformance improvements over CUDA cores.


CUDA cores:

Does a single value multiplication per one GPU clock

1 x 1 per GPU clock

TENSOR cores:

Does a matrix multiplication per one GPU clock

[1 1 1       [1 1 1
 1 1 1   x    1 1 1    per GPU clock
 1 1 1]       1 1 1]

To be more precise TENSOR core does the computation of many CUDA cores in the same time.


Tensor cores use a lot less computation power at the expense of precision than Cuda Cores, but that loss of precision doesn't have that much effect on the final output.

This is why for Machine Learning models, Tensor Cores are more effective at cost reduction without changing the output that much.

Google itself uses the Tensor Processing Units for google translate.