Why is the max amount of shared memory per block on Ampere GPUs not a multiple of 16 KiB?
Quoting https://docs.nvidia.com/cuda/cuda-c-programming-guide/#shared-memory-8-x
Similar to the Volta architecture, the amount of the unified data cache reserved for shared memory is configurable on a per kernel basis. For the NVIDIA Ampere GPU architecture, the unified data cache has a size of 192 KB for devices of compute capability 8.0 and 128 KB for devices of compute capability 8.6. The shared memory capacity can be set to 0, 8, 16, 32, 64, 100, 132 or 164 KB for devices of compute capability 8.0, and to 0, 8, 16, 32, 64 or 100 KB for devices of compute capability 8.6.
Note that the maximum amount of shared memory per thread block is smaller than the maximum shared memory partition available per SM. The 1 KB of shared memory not made available to a thread block is reserved for system use.
So, in total, there is a multiple of 16kb of memory available, but it is split up with the cache.