Coding CUDA with C#?

I've been looking for some information on coding CUDA (the nvidia gpu language) with C#. I have seen a few of the libraries, but it seems that they would add a bit of overhead (because of the p/invokes, etc).

  • How should I go about using CUDA in my C# applications? Would it be better to code it in say C++ and compile that into a dll?
  • Would this overhead of using a wrapper kill any advantages I would get from using CUDA?
  • And are there any good examples of using CUDA with C#?

Solution 1:

There is such a nice complete cuda 4.2 wrapper as ManagedCuda. You simply add C++ cuda project to your solution, which contains yours c# project, then you just add

call "%VS100COMNTOOLS%vsvars32.bat"
for /f %%a IN ('dir /b "$(ProjectDir)Kernels\*.cu"') do nvcc -ptx -arch sm_21 -m 64 -o "$(ProjectDir)bin\Debug\%%~na_64.ptx" "$(ProjectDir)Kernels\%%~na.cu"
for /f %%a IN ('dir /b "$(ProjectDir)Kernels\*.cu"') do nvcc -ptx -arch sm_21 -m 32 -o "$(ProjectDir)bin\Debug\%%~na.ptx" "$(ProjectDir)Kernels\%%~na.cu"

to post-build events in your c# project properties, this compiles *.ptx file and copies it in your c# project output directory.

Then you need simply create new context, load module from file, load function and work with device.

//NewContext creation
CudaContext cntxt = new  CudaContext();

//Module loading from precompiled .ptx in a project output folder
CUmodule cumodule = cntxt.LoadModule("kernel.ptx");

//_Z9addKernelPf - function name, can be found in *.ptx file
CudaKernel addWithCuda = new CudaKernel("_Z9addKernelPf", cumodule, cntxt);

//Create device array for data
CudaDeviceVariable<cData2> vec1_device = new CudaDeviceVariable<cData2>(num);            

//Create arrays with data
cData2[] vec1 = new cData2[num];

//Copy data to device
vec1_device.CopyToDevice(vec1);

//Set grid and block dimensions                       
addWithCuda.GridDimensions = new dim3(8, 1, 1);
addWithCuda.BlockDimensions = new dim3(512, 1, 1);

//Run the kernel
addWithCuda.Run(
    vec1_device.DevicePointer, 
    vec2_device.DevicePointer, 
    vec3_device.DevicePointer);

//Copy data from device
vec1_device.CopyToHost(vec1);

Solution 2:

This has been commented on the nvidia list in the past:

http://forums.nvidia.com/index.php?showtopic=97729

it would be easy to use P/Invoke to use it in assemblies like so:

  [DllImport("nvcuda")]
  public static extern CUResult cuMemAlloc(ref CUdeviceptr dptr, uint bytesize);

Solution 3:

I guess Hybridizer, explained here as a blog post on Nvidia is also worth to mention. Here is its related GitHub repo it seems.