Understanding the parameters of PTX instruction mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32
How to understand the parameters in the following snippet of CUDA inline assembly code?
......
asm volatile( \
"mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 \n" \
" {%0, %1, %2, %3}, \n" \
" {%4, %5, %6, %7}, \n" \
" {%8, %9}, \n" \
" {%0, %1, %2, %3}; \n" \
: "+f"( elt(0)), "+f"( elt(1)), "+f"( elt(2)), "+f"( elt(3))
: "r"(a.reg(0)), "r"(a.reg(1)), "r"(a.reg(2)), "r"(a.reg(3))
, "r"(b.reg(0)), "r"(b.reg(1)));
......
The matrix multiplication and addition from D(16x8) = A(16x16) * B(16x8) + C(16*8). So why the parameter count of D:A:B:C is 4:4:2:4 but not 2:4:2:2?
Well, those matrix operations PTX instructions sure are quite a lot to wrap one's head around!
But - let's read the relevant section of the PTX ISA reference carefully:
- Multiplicand A, element type
.f16
: A vector expression containing four.f16
x2 registers. - Multiplicand B, element type
.f16
: A vector expression containing two.f16
x2 registers. - Addend C, element type
.f32
: A vector expression containing four.f32
registers. - Result D, element type
.f32
: A vector expression containing four.f32
registers.
A x B reduces the matrix size to that of B, which would be 2 rather than 4 in the vector expression size, and thus you might expect C to have a vector expression size of 2, h-o-w-e-v-e-r - you increase the element size to .f32
, and thus instead of each .f16
x2, you have a single .f32
, so you stay with a 4-element vector expression rather than 2 accounting for that fact.
So, 4:4:2:4 for D:A:B:C.