CUDA: How to use -arch and -code and SM vs COMPUTE
Some related questions/answers are here and here.
I am still not sure how to properly specify the architectures for code generation when building with nvcc.
A complete description is somewhat complicated, but there are intended to be relatively simple, easy-to-remember canonical usages. Compile for the architecture (both virtual and real), that represents the GPUs you wish to target. A fairly simple form is:
-gencode arch=compute_XX,code=sm_XX
where XX is the two digit compute capability for the GPU you wish to target. If you wish to target multiple GPUs, simply repeat the entire sequence for each XX target. This is approximately the approach taken with the CUDA sample code projects. (If you'd like to include PTX in your executable, include an additional -gencode
with the code
option specifying the same PTX virtual architecture as the arch
option).
Another fairly simple form, when targetting only a single GPU, is just to use:
-arch=sm_XX
with the same description for XX. This form will include both SASS and PTX for the specified architecture.
Now, according to this apart from the two compiler flags there are also two ways of specifying architectures: sm_XX and compute_XX, where compute_XX refers to a virtual and sm_XX to a real architecture. The flag -arch only takes identifiers for virtual architectures (such as compute_XX) whereas the -code flag takes both, identifiers for real and for virtual architectures.
That is basically correct when arch
and code
are used as sub-switches within the -gencode
switch, or if both are used together, standalone as you describe. But, for example, when -arch
is used by itself (without -code
), it represents another kind of "shorthand" notation, and in that case, you can pass a real architecture, for example -arch=sm_52
However, it is not clear which PTX or binary code will be embedded in the binary. If I specify for example -arch=compute_30 -code=sm_52, does that mean my code will first be compiled to feature level 3.0 PTX from which afterwards machine code for feature level 5.2 will be created from? And what will be embedded?
The exact definition of what gets embedded varies depending on the form of the usage. But for this example:
-gencode arch=compute_30,code=sm_52
or for the equivalent case you identify:
-arch=compute_30 -code=sm_52
then yes, it means that:
- A temporary PTX code will be generated from your source code, and it will use cc3.0 PTX.
- From that PTX, the
ptxas
tool will generate cc5.2-compliant SASS code. - The SASS code will be embedded in your executable.
- The PTX code will be discarded.
(I'm not sure why you would actually specify such a combo, but it is legal.)
If I just specify -code=sm_52 what will happen then? Only machine code for V5.2 will be embedded that has been created out of V5.2 PTX code? And what would be the difference to -code=compute_52?
-code=sm_52
will generate cc5.2 SASS code out of an intermediate PTX code. The SASS code will be embedded, the PTX will be discarded. Note that specifying this option by itself in this form, with no -arch
option, would be illegal. (1)
-code=compute_52
will generate cc5.x PTX code (only) and embed that PTX in the executable/binary. Note that specifying this option by itself in this form, with no -arch
option, would be illegal. (1)
The cuobjdump
tool can be used to identify what components exactly are in a given binary.
(1) When no -gencode
switch is used, and no -arch
switch is used, nvcc
assumes a default -arch=sm_20
is appended to your compile command (this is for CUDA 7.5, the default -arch
setting may vary by CUDA version). sm_20
is a real architecture, and it is not legal to specify a real architecture on the -arch
option when a -code
option is also supplied.