How and when should I use pitched pointer with the cuda API?

I have quite a good understanding about how to allocate and copy linear memory with cudaMalloc() and cudaMemcpy(). However, when I want to use the CUDA functions to allocate and copy 2D or 3D matrices, I am often befuddled by the various arguments, especially concerning pitched pointers which are always present when dealing with 2D/3D arrays. The documentation is good for providing a couple examples on how to use them but it assumes that I am familiar with the notion of padding and pitch, which I am not.

I usually end up tweaking the various examples I find in the documentation or somewhere else on the web, but the blind debugging that follows is quite painful, so my question is:

What is a pitch? How do I use it? How do I allocate and copy 2D and 3D arrays in CUDA?


Solution 1:

Here is an explanation about pitched pointer and padding in CUDA.

Linear memory vs padded memory

First, lets start with the reason for the existence of non linear memory. When allocating memory with cudaMalloc, the result is like an allocation with malloc, we have a contiguous memory chunk of the size specified and we can put anything we want in it. If we want to allocate a vector of 10000 float, we simply do:

float* myVector;
cudaMalloc(&myVector, 10000*sizeof(float));

and then access ith element of myVector by classic indexing:

float element = myVector[i];

and if we want to access the next element, we just do:

float next_element = myvector[i+1];

It works very fine because accessing an element right next to the first one is (for reasons I am not aware of and I don't wish to be for now) cheap.

Things become a little bit different when we use our memory as a 2D array. Lets say our 10000 float vector is in fact a 100x100 array. We can allocate it by using the same cudaMalloc function, and if we want to read the i-th row, we do:

float* myArray;
cudaMalloc(&myArray, 10000*sizeof(float));
int row[100];  // number of columns
for (int j=0; j<100; ++j)
    row[j] = myArray[i*100+j];

Word alignment

So we have to read memory from myArray+100*i to myArray+101*i-1. The number of memory access operation it will take depends on the number of memory words this row takes. The number of bytes in a memory word depends on the implementation. To minimize the number of memory accesses when reading a single row, we must assure that we start the row on the start of a word, hence we must pad the memory for every row until the start of a new one.

Bank conflicts

Another reason for padding arrays is the bank mechanism in CUDA, concerning shared memory access. When the array is in the shared memory, it is split into several memory banks. Two CUDA threads can access it simultaneously, provided they don't access memory belonging to the same memory bank. Since we usually want to treat each row in parallel, we can ensure that we can access it simulateously by padding each row to the start of a new bank.

Now, instead of allocating the 2D array with cudaMalloc, we will use cudaMallocPitch:

size_t pitch;
float* myArray;
cudaMallocPitch(&myArray, &pitch, 100*sizeof(float), 100);  // width in bytes by height

Note that the pitch here is the return value of the function: cudaMallocPitch checks what it should be on your system and returns the appropriate value. What cudaMallocPitch does is the following:

  1. Allocate the first row.
  2. Check if the number of bytes allocated makes it correctly aligned. For example that it is a multiple of 128.
  3. If not, allocate further bytes to reach the next multiple of 128. the pitch is then the number of bytes allocated for a single row, including the extra bytes (padding bytes).
  4. Reiterate for each row.

At the end, we have typically allocated more memory than necessary because each row is now the size of pitch, and not the size of w*sizeof(float).

But now, when we want to access an element in a column, we must do:

float* row_start = (float*)((char*)myArray + row * pitch);
float column_element = row_start[column];

The offset in bytes between two successive columns can no more be deduced from the size of our array, that is why we want to keep the pitch returned by cudaMallocPitch. And since pitch is a multiple of the padding size (typically, the biggest of word size and bank size), it works great. Yay.

Copying data to/from pitched memory

Now that we know how to create and access a single element in an array created by cudaMallocPitch, we might want to copy whole part of it to and from other memory, linear or not.

Lets say we want to copy our array in a 100x100 array allocated on our host with malloc:

float* host_memory = (float*)malloc(100*100*sizeof(float));

If we use cudaMemcpy, we will copy all the memory allocated with cudaMallocPitch, including the padded bytes between each rows. What we must do to avoid padding memory is copying each row one by one. We can do it manually:

for (size_t i=0; i<100; ++i) {
  cudaMemcpy(host_memory[i*100], myArray[pitch*i],
             100*sizeof(float), cudaMemcpyDeviceToHost);
}

Or we can tell the CUDA API that we want only the useful memory from the memory we allocated with padding bytes for its convenience so if it could deal with its own mess automatically it would be very nice indeed, thank you. And here enters cudaMemcpy2D:

cudaMemcpy2D(host_memory, 100*sizeof(float)/*no pitch on host*/,
             myArray, pitch/*CUDA pitch*/,
             100*sizeof(float)/*width in bytes*/, 100/*heigth*/, 
             cudaMemcpyDeviceToHost);

Now the copy will be done automatically. It will copy the number of bytes specified in width (here: 100xsizeof(float)), height time (here: 100), skipping pitch bytes every time it jumps to a next row. Note that we must still provide the pitch for the destination memory because it could be padded, too. Here it is not, so the pitch is equal to the pitch of a non-padded array: it is the size of a row. Note also that the width parameter in the memcpy function is expressed in bytes, but the height parameter is expressed in number of elements. That is because of the way the copy is done, someway like I wrote the manual copy above: the width is the size of each copy along a row (elements that are contiguous in memory) and the height is the number of times this operation must be accomplished. (These inconsistencies in units, as a physicist, annoys me very much.)

Dealing with 3D arrays

3D arrays are no different that 2D arrays actually, there is no additional padding included. A 3D array is just a 2D classical array of padded rows. That is why when allocating a 3D array, you only get one pitch that is the difference in bytes count between to successive points along a row. If you want to access to successive points along the depth dimension, you can safely multiply the pitch by the number of columns, which gives you the slicePitch.

The CUDA API for accessing 3D memory is slightly different than the one for 2D memory, but the idea is the same :

  • When using cudaMalloc3D, you receive a pitch value that you must carefully keep for subsequent access to the memory.
  • When copying a 3D memory chunk, you cannot use cudaMemcpy unless you are copying a single row. You must use any other kind of copy utility provided by the CUDA utility that takes the pitch into account.
  • When you copy your data to/from linear memory, you must provide a pitch to your pointer even though it is irrelevant: this pitch is the size of a row, expressed in bytes.
  • The size parameters are expressed in bytes for the row size, and in number of elements for the column and depth dimension.