After getting fed up debugging the GPU BCL code using printf’s, I finally have NVIDIA Nsight working with Campy–at least partially.
One problem was that all the examples I wrote always executed the program in the directory containing the executable. So, Campy examples would always work–by magic. However, if I tried to debug a Campy program using Nsight, it would always fail on a call to the Mono Cecil method Resolve(). Nsight implements debugging using a client/server model, which is pretty much how all debuggers work. However, the server would not honor executing the program in the location specified. Instead, it would execute the assembly from some directory other than where the test program resided. As it turns out, Mono Cecil requires an Assembly Resolver so Resolve() would find dependent assemblies. Adding code for a resolver finally fixed the problem of debugging Campy using Nsight.
A second problem was that Nsight didn’t understand the debugging information in the PTX files generated when I compile the BCL CUDA source. I partially fixed this so I can at least set breakpoints, and step through kernels by changing the NVCC compiler options to interleave source in PTX (–source-in-ptx), not generate GPU debug information (no -G), generate line number information (-lineinfo). The other options I use are –keep, -rdc=true, –compile, -machine 64, -cudart static, compute_35,sm_35. I tried various options in cuModuleLoadDataEx with PTX files produced with -G, but to no avail. But, there could be a problem with my CUDA C# library Swigged.CUDA, where the option values may not be passed correctly.
Third, CUDA programs execute with a small runtime stack size, so allocating buffers like char buf; blows the stack very quickly.
Although the GPU BCL type system is getting closer to working, it still doesn’t. More hacking required.