Understanding the parameters of PTX instruction mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32

How to understand the parameters in the following snippet of CUDA inline assembly code?

......
        asm volatile( \
            "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 \n" \
            "    {%0, %1, %2, %3}, \n" \
            "    {%4, %5, %6, %7}, \n" \
            "    {%8, %9}, \n" \
            "    {%0, %1, %2, %3}; \n" \
                    : "+f"(  elt(0)), "+f"(  elt(1)), "+f"(  elt(2)), "+f"(  elt(3))
                    :  "r"(a.reg(0)),  "r"(a.reg(1)),  "r"(a.reg(2)),  "r"(a.reg(3))
                    ,  "r"(b.reg(0)),  "r"(b.reg(1)));
......

The matrix multiplication and addition from D(16x8) = A(16x16) * B(16x8) + C(16*8). So why the parameter count of D:A:B:C is 4:4:2:4 but not 2:4:2:2?


Well, those matrix operations PTX instructions sure are quite a lot to wrap one's head around!

But - let's read the relevant section of the PTX ISA reference carefully:

  • Multiplicand A, element type .f16 : A vector expression containing four .f16x2 registers.
  • Multiplicand B, element type .f16 : A vector expression containing two .f16x2 registers.
  • Addend C, element type .f32 : A vector expression containing four .f32 registers.
  • Result D, element type .f32 : A vector expression containing four .f32 registers.

A x B reduces the matrix size to that of B, which would be 2 rather than 4 in the vector expression size, and thus you might expect C to have a vector expression size of 2, h-o-w-e-v-e-r - you increase the element size to .f32, and thus instead of each .f16x2, you have a single .f32, so you stay with a 4-element vector expression rather than 2 accounting for that fact.

So, 4:4:2:4 for D:A:B:C.