News

Upcoming Releases

The next release or two of Campy will be hammered out over the following weeks.

One will deal with the implementation of C# generics, which regressed a few months ago after the move to the GPU BCL reference type allocation. It kind of didn’t work all that well, and was a kludge, so it needed to be rewritten. Further, much of the BCL uses generics, e.g., System.Console.WriteLine(), so this must be sorted out as soon as possible.

The other will deal with Campy on Ubuntu. There isn’t any really good reason why Campy cannot be run on Ubuntu, so that also will be fixed. There is already a build for Swigged.LLVM for Ubuntu, and there will be a build of Swigged.CUDA for Ubuntu shortly. I’ll also need to get the GPU BCL of Campy to compile on Ubuntu, but it shouldn’t be any harder than the previously mentioned libraries.

I’m not sure which feature will come first, but generally speaking, a new version of Campy should be available every few weeks.

  • Support of enum types (13).
  • Performance improvement in basic block discovery of kernel code (77cee89).
  • Fix to GPU BCL type system initialization (14).
  • Partitioning the build of the runtime from the compiler so that it can be built for Linux. Adding in Linux build. There are a number of ways I’m looking into how to do the build, including the Linux C++ build feature in Visual Studio.
  • Rewriting the compiler so that phases are chained methods and renaming the phases that indicate what each does.

 

Release 0.0.8

Campy version 0.0.8 has been released. The changes to Campy since release 0.0.7 have centered around the integration of the GPU BCL (i.e., the “Dot Net Anywhere” runtime that is being used on the GPU after porting to CUDA C) into the compiler. Unfortunately, the effort has set into motion a large number of changes. Some of those changes I expected, but many were not.

  • Up to now, C# objects on the GPU were allocated using a “malloc” of pinned memory. This memory was allocated in C# on the host CPU using Cuda.cuMemHostAlloc(), and is accessible on the CPU and GPU. But, C# objects are managed, meaning that the BCL should know the type of the object when a pointer is passed to it. With the recent changes to Campy, C# objects accessible on the GPU are now allocated using the GPU BCL. (987209c and others).
  • The GPU BCL needs to be accessible on both the GPU and CPU because the memory allocation on the CPU needs to be recorded by the GPU BCL. Considerable time was devoted to figure out how to write C# code to call unmanaged C code in a DLL that contains the GPU BCL (example). For the GPU, a static .LIB file is generated that contains pre-linked code (via nvcc -dlink). For the CPU, an unmanaged layer written in C/C++ is provided in a DLL. C# code calls the DLL API using P/Invoke.
  • The assembly containing the kernel needs to be loaded by the GPU BCL. Campy “worked” before but used the meta only on the CPU side (using Mono.Cecil). The GPU BCL now reads the meta for any assemblies referenced.
  • Even though Campy is supposed to be Net Standard 2.0 code, “dotnet build” of Campy wouldn’t build. As it turned out, Swigged.LLVM and Swigged.CUDA contained references to native libraries which prevented building Campy via Dotnet.exe. Those packages have been updated so that the native libraries are now in the proper sub-directory (Swigged.LLVM 6.0.5; Swigged.CUDA 9.185.5).
  • The pre-build code in the .TARGETS file of Swigged.LLVM and Swigged.CUDA don’t work with “dotnet build” because Dotnet does not create output directories before the running the pre-build steps. The build now performs a copy using a completely different Msbuild mechanism.
  • This release fixes line-oriented debugging of kernel code (11). Due to quirkiness of Mono Cecil (2116ef7), method references in CIL call instructions would not have debugging symbols loaded. A problem in instruction discovery (IMPORTER) existed with CIL call rewrite: the offset of the instruction was not set. These problems are now fixed.
  • Many compiler warnings were cleaned up. A Dotnet build of Campy is completely error and warning free.
  • Note, Dotnet works in a different directory from the application that you build. In order to find all dependent dlls and libs, you will need to change directory to the application, or “publish -r win10-x64” the application. Finding dlls is still a mess, but Campy with Net Core and Net Framework does work.
  • Nsight does not work with Net Core apps. I have no idea why Nsight is so messed up. Build the application as a Net Framework app, and it’ll all work as expected. Make sure it’s a 64-bit app you are building; Campy only works with 64-bit apps.

The release is in Nuget.org. You will need Net Core 2.0 and have CUDA GPU Toolkit 9.1.45 installed on Windows. To take full advantage of Campy, e.g., debugging with Nsight, you will need Visual Studio 15. You should be able to use the latest version of Visual Studio, although I haven’t tried because the GPU toolkit compiles C++ with VS version 15.4. Dotnet published Net Core 2.0 apps should run with only the GPU Toolkit installed.

Release 0.0.7

Release 0.0.7 of Campy fixes multidimensional arrays and adds simple line/column debugging information to the generated code.

  • Implement line debugging of kernel code (4c15bda, b7). Note, there are bugs still in the implementation: it works only for straight line code, no branching (see bug entry). This will be fixed in the next release. Also, you will probably need to use the “Start CUDA debugging (legacy)” menu command of the NVIDIA Nsight debugger version 5.5. The “Next Gen” debugger works only in TCC mode. Looking forward to NVIDIA allowing for combined CPU/GPU debugging in the future. Make sure to follow the instructions for Nsight. Set breakpoints in your C# kernel code before you start.
  • Fixing 2D arrays (1284d27).
  • Fix “ceq” instruction code generation (75b990d, b9). In certain situations, the compiler would generate incorrect code.

The release is in Nuget.org. You will need VS2017 15.4.5 and CUDA GPU Toolkit 9.1.45 installed on Windows. However, once developed, you only need the CUDA GPU Toolkit installed on the system that has the GPU card.

Release 0.0.6

Release 0.0.6 of Campy improves on the correctness and stability.

  • Implemented “ref” parameters for methods (bb4062f)
  • Corrected the semantics of Campy.Sequential.For() (9f14e66)
  • Fixed GCHandle.Free() of an uninitialized handle (fba9328)
  • Fixed API for memory management (4ede1e1 and others)
  • Added several examples for sorting to the unit tests (Comb, Bitonic, Even/Odd)

The release is in Nuget.org. You will need VS2017 15.4.5 and CUDA GPU Toolkit 9.1.45 installed on Windows.

 

Release 0.0.5

I have just released a new version of Campy. This version incorporates the new GPU Base Class Layer, value and reference types. It does not handle generics, as in my reorganization, I broke that feature; It does not handle ref parameters. To try out Campy, you must use Visual Studio 2017 15.4, an NVIDIA Maxwell or newer GPU, and CUDA GPU Toolkit 9.1.85. I do not check these in code, so please make sure you have the prerequisites. Create a NET Core 2.0 application and add Campy from NuGet.org. For examples of Campy, see the Campy tests directory, such as Reduction. Undoubtedly, there are bugs and it is somewhat slow due to JIT compilation at the time of the Parallel.For() call. You can improve the JIT speed if you “warm-up” the GPU using the kernel code first since the JIT object code is cached for subsequent calls. –Ken Domino

Status

Here is the latest on Campy:

  • The base class layer for the GPU has been fixed so that there are no longer any global variables. C++ global variables are initialized when Campy JITs a kernel. In order to avoid re-initializing the base class layer, all but one C++ global variables are now placed in a structure. The runtime is now initialized only once, a critical performance improvement.
  • An API for managing GPU/CPU memory synchronization has been added. Previously, after each Parallel.For() call, memory was copied back to C# data structures on the CPU. With an explicit API to synchronize memory copying, certain algorithms, e.g., FFT and Reduction, which nest Parallel.For() calls with another loop, are now much faster.
  • I have spent a lot of time making changes to Swigged.LLVM, which is used by Campy. Swigged.LLVM is now fully built automatically in Appveyor. And, it has been updated to the latest release of LLVM, version 6.0.
  • I will be making a release of Campy in the next month if all goes well.

Global variables in CUDA

Campy JIT’s a C#/NET lambda expression into NVIDIA PTX, which is then linked with the GPU base class library runtime, loaded into a cuModule, and executed. C++ global variables are fine, accessible and stable between calls to different kernels via Cuda.cuLaunchKernel(). Unfortunately, global variables are not shared between cuModule’s. As noted in the CUDA C Programming Guide, “The names for all symbols, including functions, global variables, and texture or surface references, are maintained at module scope so that modules written by independent third parties may interoperate in the same CUDA context.” Further, “Any user defined __device__, __constant__ or __managed__ device global variables present in the module that owns the CUfunction being launched are independently instantiated on every device. The user is responsible for initializing such device global variables appropriately.”

At this moment, each call to Parallel.For() results in a separate JIT and CUDA module. The runtime is initialized with each Parallel.For(). Unfortunately, initialization of the base class layer is quite time-consuming, especially on the GPU. So it should be done only once, either on the CPU or GPU. Therefore, C++ global variables need to be eliminated. But, the GPU base class layer library contains dozens of global variables and hundreds of references. Another mass edit of the DotNetAnywhere code, and bump on the road to full C# support on the GPU.

CUDA and kernel malloc’s

If you ever tried to call malloc() in a CUDA kernel, you might be surprised to find out that it seems to “work.” You get a buffer allocated, and you can assign values to it. However, if you thought that the buffer could then be accessed on the host CPU via cuMemcpyDtoH, or directly accessed on the CPU, guess again: cuMemcpyDtoH returns a cudaErrorInvalidValue (0x0000000b), and you get a segv if you try to access the buffer directly from the CPU. While some folks seem to have come to the same conclusion [1, 2], you may find others saying that you can–of course, without backing up the claim with evidence [3]. My own test, with the code below, demonstrates that one cannot access kernel malloc() buffers on the CPU.

Since malloc() buffers can’t be accessed on the host, Campy implements a memory allocator. A heap is allocated on the CPU and is used to allocate from on the GPU. This allocator is quite primitive. At some point, I hope to write a better memory allocator, such as the one by Huang [4].

[1] https://stackoverflow.com/questions/42000339/use-data-allocated-dynamically-in-cuda-kernel-on-host

[2] https://stackoverflow.com/questions/13480213/how-to-dynamically-allocate-arrays-inside-a-kernel

[3] http://heather.cs.ucdavis.edu/~matloff/158/PLN/CUDA.tex

[4] Huang, X. Xmalloc: a scalable lock-free dynamic memory allocator for many-core machines. MS Thesis, the University of Illinois at Urbana-Champaign. 2010. https://www.ideals.illinois.edu/bitstream/handle/2142/16137/Huang_Xiaohuang.pdf?sequence=1&isAllowed=y

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>

cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);

__global__ void addKernel(int *c, const int *a, const int *b, void * * d)
{
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
	int * ddd;
	ddd = (int*)malloc(sizeof(int));
	d[i] = ddd;
	int * p = (int*)d[i];
	*ddd = i;
}

int main()
{
    const int arraySize = 5;
    const int a[arraySize] = { 1, 2, 3, 4, 5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    int c[arraySize] = { 0 };

    // Add vectors in parallel.
    cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addWithCuda failed!");
        return 1;
    }

    printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
        c[0], c[1], c[2], c[3], c[4]);

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    return 0;
}

// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size)
{
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
	void * * dev_d = 0;
    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        goto Error;
    }

    // Allocate GPU buffers for three vectors (two input, one output)    .
    cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    // Copy input vectors from host memory to GPU buffers.
    cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

	cudaStatus = cudaMalloc((void**)&dev_d, size * sizeof(void*));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMalloc failed!");
		goto Error;
	}


    // Launch a kernel on the GPU with one thread for each element.
    addKernel<<<1, size>>>(dev_c, dev_a, dev_b, dev_d);

    // Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }
    
    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;
    }

	// Copy output vector from GPU buffer to host memory.
	cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMemcpy failed!");
		goto Error;
	}

	void* * d = (void**)malloc(sizeof(void*) * size);
	cudaStatus = cudaMemcpy(d, dev_d, size * sizeof(void*), cudaMemcpyDeviceToHost);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMemcpy failed!");
		goto Error;
	}

	// Copy from device malloc
	int jjj;
	cudaStatus = cudaMemcpy(&jjj, d[0], sizeof(int), cudaMemcpyDeviceToHost);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMemcpy failed!");
		goto Error;
	}




Error:
    cudaFree(dev_c);
    cudaFree(dev_a);
    cudaFree(dev_b);
    
    return cudaStatus;
}


Progress with a type system

After getting fed up debugging the GPU BCL code using printf’s, I finally have NVIDIA Nsight working with Campy–at least partially.

One problem was that all the examples I wrote always executed the program in the directory containing the executable.  So, Campy examples would always work–by magic. However, if I tried to debug a Campy program using Nsight, it would always fail on a call to the Mono Cecil method Resolve(). Nsight implements debugging using a client/server model, which is pretty much how all debuggers work. However, the server would not honor executing the program in the location specified. Instead, it would execute the assembly from some directory other than where the test program resided. As it turns out, Mono Cecil requires an Assembly Resolver so Resolve() would find dependent assemblies. Adding code for a resolver finally fixed the problem of debugging Campy using Nsight.

A second problem was that Nsight didn’t understand the debugging information in the PTX files generated when I compile the BCL CUDA source. I partially fixed this so I can at least set breakpoints, and step through kernels by changing the NVCC compiler options to interleave source in PTX (–source-in-ptx), not generate GPU debug information (no -G), generate line number information (-lineinfo). The other options I use are –keep, -rdc=true, –compile, -machine 64, -cudart static, compute_35,sm_35. I tried various options in cuModuleLoadDataEx with PTX files produced with -G, but to no avail. But, there could be a problem with my CUDA C# library Swigged.CUDA, where the option values may not be passed correctly.

Third, CUDA programs execute with a small runtime stack size, so allocating automatics like char buf[2000]; blows the stack very quickly.

Although the GPU BCL type system is getting closer to working, it still doesn’t. More hacking required.

–Ken Domino