CUDA: How to use -arch and -code and SM vs COMPUTE

Some related questions/answers are here and here.

I am still not sure how to properly specify the architectures for code generation when building with nvcc.

A complete description is somewhat complicated, but there are intended to be relatively simple, easy-to-remember canonical usages. Compile for the architecture (both virtual and real), that represents the GPUs you wish to target. A fairly simple form is:

-gencode arch=compute_XX,code=sm_XX

where XX is the two digit compute capability for the GPU you wish to target. If you wish to target multiple GPUs, simply repeat the entire sequence for each XX target. This is approximately the approach taken with the CUDA sample code projects. (If you'd like to include PTX in your executable, include an additional -gencode with the code option specifying the same PTX virtual architecture as the arch option).

Another fairly simple form, when targetting only a single GPU, is just to use:

-arch=sm_XX

with the same description for XX. This form will include both SASS and PTX for the specified architecture.

Now, according to this apart from the two compiler flags there are also two ways of specifying architectures: sm_XX and compute_XX, where compute_XX refers to a virtual and sm_XX to a real architecture. The flag -arch only takes identifiers for virtual architectures (such as compute_XX) whereas the -code flag takes both, identifiers for real and for virtual architectures.

That is basically correct when arch and code are used as sub-switches within the -gencode switch, or if both are used together, standalone as you describe. But, for example, when -arch is used by itself (without -code), it represents another kind of "shorthand" notation, and in that case, you can pass a real architecture, for example -arch=sm_52

However, it is not clear which PTX or binary code will be embedded in the binary. If I specify for example -arch=compute_30 -code=sm_52, does that mean my code will first be compiled to feature level 3.0 PTX from which afterwards machine code for feature level 5.2 will be created from? And what will be embedded?

The exact definition of what gets embedded varies depending on the form of the usage. But for this example:

-gencode arch=compute_30,code=sm_52

or for the equivalent case you identify:

-arch=compute_30 -code=sm_52

then yes, it means that:

A temporary PTX code will be generated from your source code, and it will use cc3.0 PTX.
From that PTX, the ptxas tool will generate cc5.2-compliant SASS code.
The SASS code will be embedded in your executable.
The PTX code will be discarded.

(I'm not sure why you would actually specify such a combo, but it is legal.)

If I just specify -code=sm_52 what will happen then? Only machine code for V5.2 will be embedded that has been created out of V5.2 PTX code? And what would be the difference to -code=compute_52?

-code=sm_52 will generate cc5.2 SASS code out of an intermediate PTX code. The SASS code will be embedded, the PTX will be discarded. Note that specifying this option by itself in this form, with no -arch option, would be illegal. (1)

-code=compute_52 will generate cc5.x PTX code (only) and embed that PTX in the executable/binary. Note that specifying this option by itself in this form, with no -arch option, would be illegal. (1)

The cuobjdump tool can be used to identify what components exactly are in a given binary.

(1) When no -gencode switch is used, and no -arch switch is used, nvcc assumes a default -arch=sm_20 is appended to your compile command (this is for CUDA 7.5, the default -arch setting may vary by CUDA version). sm_20 is a real architecture, and it is not legal to specify a real architecture on the -arch option when a -code option is also supplied.

CUDA: How to use -arch and -code and SM vs COMPUTE

Related

Recent Posts