Caught between a rock and a hard place

The BCL for Campy is, slowly, being integrated and tested. Unfortunately, I just found out that the NVIDIA CUDA 9 GPU Toolkit does not support Kepler GPUs–which is what my 5-year old laptop has, even though it is still a fine machine. But, CUDA 8 GPU Toolkit does not support Visual Studio 2017. When I install the older Visual Studio 2015 and try to build the BCL, I find out that it cannot compile functions that use ellipsis:

__device__ void Crash(char *pMsg, ...) {}

error : a __device__ or __host__ __device__ function cannot have ellipsis

Removing the ellipsis syntax from the BCL source would require a lot of changes, which in the long run, doesn’t do anything for the BCL, except make things more unreadable. It is likely fewer people will be using Kepler cards (e.g., K80) as people are moving onto Pascal GPUs (e.g., P100). Therefore, Campy is going to require Maxwell or newer GPUs and Visual Studio 2017.

Secondly, it looks like the code for the BCL type system isn’t working. To get instructions like ldstr and newobj working, a functioning reflection type system is needed on the GPU. DotNetAnywhere has 300 lines of C-code to read an assembly (which I encode at the moment as a large byte array, avoiding fopen), and extract the metadata for the assembly. Unfortunately, after what seemed link endless cycles of debugging the CUDA code using printf, I ran the code on the CPU and found that it doesn’t work because it is designed for a 32-bit address target, whereas Campy is targeted for 64-bit programming. Not being an expert on PE file format, I’ll need to take some time to fix this code. So much for free software.

A NET GPU Base Class Library

In my last post, I mentioned that Campy was able to JIT quite a bit, but failed to JIT many kernels because a NET base class library for the GPU was required. I’m happy to say this is now corrected. The DotNetAnywhere runtime has been ported to the NVIDIA GPU, and the simple string search example I mentioned in the last blog post now works.

The GPU BCL consists of 13 thousand lines of CUDA code (in 41 .CU files, which is compiled by NVCC to generate 56 thousand lines of PTX code that is loaded when executing a kernel), and 24 thousand lines of C# code. When a kernel is run, Campy rewrites the kernel to use the GPU BCL. I still haven’t gotten over first seeing the string search kernel compile and run with all of this baggage–uh, runtime!

While an important step, there is much work to do, the least of which is a parallel memory allocator/garbage collector that will work on the GPU.