Why does these data have no change after being processed by gpu?

Make sure to always do error checking after each Runtime API. You can use the code below (not originally mine)

#define gpuErrchk(DS) { gpuAssert((DS), __FILE__, __LINE__); }
inline void gpuAssert(bool deviceSync, const char *file, int line, bool abort = true)
{
    if(deviceSync)
        cudaDeviceSynchronize();
    cudaError_t code = cudaGetLastError();
    if (code != cudaSuccess)
    {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) exit(code);
    }
}

It is not always necessary to put a cudaDeviceSynchronize when checking for errors, however, it's a good practice after kernel launches. If you call gpuErrchk() in multiple parts of your code, you'll see that there's an error in the first cudaMemcpy saying "invalid argument". It's because you must change u32 gpu_array[NUM_ELEM] to u32 *gpu_array , otherwise the memory will be allocated on the host, not the device. After this error occurs, the kernel won't be launched, either and that's why the output is equal to the input. I tested your code with arbitrary input and didn't get correct answers, anyway.