Status

Here is the latest on Campy:

  • The base class layer for the GPU has been fixed so that there are no longer any global variables. C++ global variables are initialized when Campy JITs a kernel. In order to avoid re-initializing the base class layer, all but one C++ global variables are now placed in a structure. The runtime is now initialized only once, a critical performance improvement.
  • An API for managing GPU/CPU memory synchronization has been added. Previously, after each Parallel.For() call, memory was copied back to C# data structures on the CPU. With an explicit API to synchronize memory copying, certain algorithms, e.g., FFT and Reduction, which nest Parallel.For() calls with another loop, are now much faster.
  • I have spent a lot of time making changes to Swigged.LLVM, which is used by Campy. Swigged.LLVM is now fully built automatically in Appveyor. And, it has been updated to the latest release of LLVM, version 6.0.
  • I will be making a release of Campy in the next month if all goes well.

Global variables in CUDA

Campy JIT’s a C#/NET lambda expression into NVIDIA PTX, which is then linked with the GPU base class library runtime, loaded into a cuModule, and executed. C++ global variables are fine, accessible and stable between calls to different kernels via Cuda.cuLaunchKernel(). Unfortunately, global variables are not shared between cuModule’s. As noted in the CUDA C Programming Guide, “The names for all symbols, including functions, global variables, and texture or surface references, are maintained at module scope so that modules written by independent third parties may interoperate in the same CUDA context.” Further, “Any user defined __device__, __constant__ or __managed__ device global variables present in the module that owns the CUfunction being launched are independently instantiated on every device. The user is responsible for initializing such device global variables appropriately.”

At this moment, each call to Parallel.For() results in a separate JIT and CUDA module. The runtime is initialized with each Parallel.For(). Unfortunately, initialization of the base class layer is quite time-consuming, especially on the GPU. So it should be done only once, either on the CPU or GPU. Therefore, C++ global variables need to be eliminated. But, the GPU base class layer library contains dozens of global variables and hundreds of references. Another mass edit of the DotNetAnywhere code, and bump on the road to full C# support on the GPU.

CUDA and kernel malloc’s

If you ever tried to call malloc() in a CUDA kernel, you might be surprised to find out that it seems to “work.” You get a buffer allocated, and you can assign values to it. However, if you thought that the buffer could then be accessed on the host CPU via cuMemcpyDtoH, or directly accessed on the CPU, guess again: cuMemcpyDtoH returns a cudaErrorInvalidValue (0x0000000b), and you get a segv if you try to access the buffer directly from the CPU. While some folks seem to have come to the same conclusion [1, 2], you may find others saying that you can–of course, without backing up the claim with evidence [3]. My own test, with the code below, demonstrates that one cannot access kernel malloc() buffers on the CPU.

Since malloc() buffers can’t be accessed on the host, Campy implements a memory allocator. A heap is allocated on the CPU and is used to allocate from on the GPU. This allocator is quite primitive. At some point, I hope to write a better memory allocator, such as the one by Huang [4].

[1] https://stackoverflow.com/questions/42000339/use-data-allocated-dynamically-in-cuda-kernel-on-host

[2] https://stackoverflow.com/questions/13480213/how-to-dynamically-allocate-arrays-inside-a-kernel

[3] http://heather.cs.ucdavis.edu/~matloff/158/PLN/CUDA.tex

[4] Huang, X. Xmalloc: a scalable lock-free dynamic memory allocator for many-core machines. MS Thesis, the University of Illinois at Urbana-Champaign. 2010. https://www.ideals.illinois.edu/bitstream/handle/2142/16137/Huang_Xiaohuang.pdf?sequence=1&isAllowed=y

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>

cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);

__global__ void addKernel(int *c, const int *a, const int *b, void * * d)
{
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
	int * ddd;
	ddd = (int*)malloc(sizeof(int));
	d[i] = ddd;
	int * p = (int*)d[i];
	*ddd = i;
}

int main()
{
    const int arraySize = 5;
    const int a[arraySize] = { 1, 2, 3, 4, 5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    int c[arraySize] = { 0 };

    // Add vectors in parallel.
    cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addWithCuda failed!");
        return 1;
    }

    printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
        c[0], c[1], c[2], c[3], c[4]);

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    return 0;
}

// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size)
{
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
	void * * dev_d = 0;
    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        goto Error;
    }

    // Allocate GPU buffers for three vectors (two input, one output)    .
    cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    // Copy input vectors from host memory to GPU buffers.
    cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

	cudaStatus = cudaMalloc((void**)&dev_d, size * sizeof(void*));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMalloc failed!");
		goto Error;
	}


    // Launch a kernel on the GPU with one thread for each element.
    addKernel<<<1, size>>>(dev_c, dev_a, dev_b, dev_d);

    // Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }
    
    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;
    }

	// Copy output vector from GPU buffer to host memory.
	cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMemcpy failed!");
		goto Error;
	}

	void* * d = (void**)malloc(sizeof(void*) * size);
	cudaStatus = cudaMemcpy(d, dev_d, size * sizeof(void*), cudaMemcpyDeviceToHost);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMemcpy failed!");
		goto Error;
	}

	// Copy from device malloc
	int jjj;
	cudaStatus = cudaMemcpy(&jjj, d[0], sizeof(int), cudaMemcpyDeviceToHost);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMemcpy failed!");
		goto Error;
	}




Error:
    cudaFree(dev_c);
    cudaFree(dev_a);
    cudaFree(dev_b);
    
    return cudaStatus;
}


Progress with a type system

After getting fed up debugging the GPU BCL code using printf’s, I finally have NVIDIA Nsight working with Campy–at least partially.

One problem was that all the examples I wrote always executed the program in the directory containing the executable.  So, Campy examples would always work–by magic. However, if I tried to debug a Campy program using Nsight, it would always fail on a call to the Mono Cecil method Resolve(). Nsight implements debugging using a client/server model, which is pretty much how all debuggers work. However, the server would not honor executing the program in the location specified. Instead, it would execute the assembly from some directory other than where the test program resided. As it turns out, Mono Cecil requires an Assembly Resolver so Resolve() would find dependent assemblies. Adding code for a resolver finally fixed the problem of debugging Campy using Nsight.

A second problem was that Nsight didn’t understand the debugging information in the PTX files generated when I compile the BCL CUDA source. I partially fixed this so I can at least set breakpoints, and step through kernels by changing the NVCC compiler options to interleave source in PTX (–source-in-ptx), not generate GPU debug information (no -G), generate line number information (-lineinfo). The other options I use are –keep, -rdc=true, –compile, -machine 64, -cudart static, compute_35,sm_35. I tried various options in cuModuleLoadDataEx with PTX files produced with -G, but to no avail. But, there could be a problem with my CUDA C# library Swigged.CUDA, where the option values may not be passed correctly.

Third, CUDA programs execute with a small runtime stack size, so allocating automatics like char buf[2000]; blows the stack very quickly.

Although the GPU BCL type system is getting closer to working, it still doesn’t. More hacking required.

–Ken Domino

Caught between a rock and a hard place

The BCL for Campy is, slowly, being integrated and tested. Unfortunately, I just found out that the NVIDIA CUDA 9 GPU Toolkit does not support Kepler GPUs–which is what my 5-year old laptop has, even though it is still a fine machine. But, CUDA 8 GPU Toolkit does not support Visual Studio 2017. When I install the older Visual Studio 2015 and try to build the BCL, I find out that it cannot compile functions that use ellipsis:

__device__ void Crash(char *pMsg, ...) {}

error : a __device__ or __host__ __device__ function cannot have ellipsis

Removing the ellipsis syntax from the BCL source would require a lot of changes, which in the long run, doesn’t do anything for the BCL, except make things more unreadable. It is likely fewer people will be using Kepler cards (e.g., K80) as people are moving onto Pascal GPUs (e.g., P100). Therefore, Campy is going to require Maxwell or newer GPUs and Visual Studio 2017.

Secondly, it looks like the code for the BCL type system isn’t working. To get instructions like ldstr and newobj working, a functioning reflection type system is needed on the GPU. DotNetAnywhere has 300 lines of C-code to read an assembly (which I encode at the moment as a large byte array, avoiding fopen), and extract the metadata for the assembly. Unfortunately, after what seemed link endless cycles of debugging the CUDA code using printf, I ran the code on the CPU and found that it doesn’t work because it is designed for a 32-bit address target, whereas Campy is targeted for 64-bit programming. Not being an expert on PE file format, I’ll need to take some time to fix this code. So much for free software.

A NET GPU Base Class Library

In my last post, I mentioned that Campy was able to JIT quite a bit, but failed to JIT many kernels because a NET base class library for the GPU was required. I’m happy to say this is now corrected. The DotNetAnywhere runtime has been ported to the NVIDIA GPU, and the simple string search example I mentioned in the last blog post now works.

The GPU BCL consists of 13 thousand lines of CUDA code (in 41 .CU files, which is compiled by NVCC to generate 56 thousand lines of PTX code that is loaded when executing a kernel), and 24 thousand lines of C# code. When a kernel is run, Campy rewrites the kernel to use the GPU BCL. I still haven’t gotten over first seeing the string search kernel compile and run with all of this baggage–uh, runtime!

While an important step, there is much work to do, the least of which is a parallel memory allocator/garbage collector that will work on the GPU.

Ken

I’ve a feeling we’re not in Kansas anymore

Now that Campy can compile simple value and reference types, the next issue is the runtime. Even on simple examples, when Campy decompiles a kernel in order to JIT it for the GPU, it rapidly discovers that the kernel references the NET runtime in most C# code.

In all NET runtimes, while most of the source is written in C#–which can be JIT’ed by Campy–there is a small portion of the Base Class Library  (see ECMA-334, Appendix D) that is also written in C source code, and therefore cannot be JIT’ed by Campy. Furthermore, depending on how a program that uses Campy is run, Campy decompiles different runtimes for the same kernel source code.

Let’s take a simple string search example, which searches for all occurrences of a pattern in another using a brute-force algorithm.

string text = ...;
string pattern = ...;
int[] found = ...;
Campy.Parallel.For(n, i =>
{
    int j;
    for (j = 0; j < m && i + j < n; j++)
        if (text[i + j] != pattern[j])
            break;
    // mismatch found, break the inner loop
    if (j == m)
    {
        // match found
        found[i] = 1;
    }
});

When built with NET Framework 4.7, Campy discovers the kernel method and the BCL method String.get_Chars(). However, get_Chars() has no CIL body as it is a native C function, which we also know because it is tagged “Is Internal” when using reflection.

If the program is run under Mono, Campy discovers a different implementation, all being in CIL, which can be JIT’ed.

The result is a richly textured landscape of different runtimes that need to be accommodated. That said, much of the runtime just doesn’t make any sense on the GPU: networking, graphics drawing, etc.

After reading some of the source code for Coreclr, Corert, Mono, NET Micro Framework, Dot Net Anywhere, I’ve come to the conclusion that I cannot simply replace the function calls to C functions of the runtime with unsafe C# code, which I have been doing so far. This cannot work as there must be an agreement between the implementation of the data structures in the BCL presumed between C# and C source.

Therefore, Campy must implement a BCL specific for it. Fortunately, there are plenty of examples to start from. Unfortunately, while CUDA C++ on Windows can compile C++11/14, it requires all device-runnable functions to be tagged with the __device__ qualifier, a rather annoying requirement in CUDA.

 

Advancing with value and reference types, and onto a runtime

There has been significant progress of Campy. It is now starting to compile both value types and reference types, and small bits of the NET runtime. For value types, it compiles the usual ints, floats, doubles, etc., but structs as well. For methods, it compiles static and non-virtual functions. Support for NET runtime is still next to nil because I cannot get Mono Cecil to decompile the bytecode of the NET runtime.

It hasn’t been easy getting to this stage. Despite years in compiler development, and many more years as a developer, I haven’t kept up with the latest tech. Most compiler writers now to use LLVM, which I wasn’t at all familiar with, tending to write my own code for everything. The sparsity of examples showing how to use LLVM-C didn’t help. Fortunately, I am coming up to speed, and when I get some time to breath, I plan on writing some very basic straight line code that will function as both examples and unit tests of LLVM-C. It’s taken two months to write a thunking layer to get C# to talk to LLVM-C, a month to write a thunking layer to get C# to talk to the CUDA Driver API, and now three months of 12+ hours per day/7 days per week to get the compiler to translate CIL into NVIDIA PTX, copy data structures between C# and an internal representation, and running the GPU code.

Two important examples now work: reduction over integer addition and the Fast Fourier Transform. The later uses the System.Numerics.Complex, which is part of the NET runtime. Complex is a struct (i.e., value type). This example cannot be compiled by any of the other C#/GPU compilers out there (Alea GPU, ILGPU). Note, the combination of the time for JIT compilation and the deep data structure copying to/from pinned global memory for the GPU makes for a very slow implementation of algorithms for the GPU. But, I have plans to fix this.

Speaking of which, it turns out there is quite a bit of NET runtime that must be written as a drop-in replacement for certain types one uses. For example, you can index a string to get at characters, but this calls a NET runtime method for P/Invoke functions. There are basically two runtimes from which to choose: Mono or Net Core. At this point, I am not quite sure which will be best.

I’ve been making semi-regular releases of Campy for 4.7 Net Framework programs on Windows systems with NVIDIA GPU’s. It’s still has a long way to go, but it’s very encouraging to see some rather complex examples working.

Given the progress in Campy, I’m planning on going to NVIDIA’s GPU Technology Conference in March 2018. In fact, I will be submitting a proposal for a presentation on Campy for the conference. I hope my proposal is accepted, and I have a chance to meet other C#/GPU developers at the conference.

Here is the code for reduction and the FFT for you to get an idea of what things are looking like.

Reduction

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using Campy;

namespace Reduction
{
    public class Bithacks
    {
        static bool preped;

        static int[] LogTable256 = new int[256];

        static void prep()
        {
            LogTable256[0] = LogTable256[1] = 0;
            for (int i = 2; i < 256; i++)
            {
                LogTable256[i] = 1 + LogTable256[i / 2];
            }
            LogTable256[0] = -1; // if you want log(0) to return -1

            // Prepare the reverse bits table.
            prep_reverse_bits();
        }

        public static int FloorLog2(uint v)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            int r; // r will be lg(v)
            uint tt; // temporaries

            if ((tt = v >> 24) != 0)
            {
                r = (24 + LogTable256[tt]);
            }
            else if ((tt = v >> 16) != 0)
            {
                r = (16 + LogTable256[tt]);
            }
            else if ((tt = v >> 8) != 0)
            {
                r = (8 + LogTable256[tt]);
            }
            else
            {
                r = LogTable256[v];
            }
            return r;
        }

        public static long FloorLog2(ulong v)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            long r; // r will be lg(v)
            ulong tt; // temporaries

            if ((tt = v >> 56) != 0)
            {
                r = (56 + LogTable256[tt]);
            }
            else if ((tt = v >> 48) != 0)
            {
                r = (48 + LogTable256[tt]);
            }
            else if ((tt = v >> 40) != 0)
            {
                r = (40 + LogTable256[tt]);
            }
            else if ((tt = v >> 32) != 0)
            {
                r = (32 + LogTable256[tt]);
            }
            else if ((tt = v >> 24) != 0)
            {
                r = (24 + LogTable256[tt]);
            }
            else if ((tt = v >> 16) != 0)
            {
                r = (16 + LogTable256[tt]);
            }
            else if ((tt = v >> 8) != 0)
            {
                r = (8 + LogTable256[tt]);
            }
            else
            {
                r = LogTable256[v];
            }
            return r;
        }

        public static int CeilingLog2(uint v)
        {
            int r = Bithacks.FloorLog2(v);
            if (r < 0)
                return r;
            if (v != (uint)Bithacks.Power2((uint)r))
                return r + 1;
            else
                return r;
        }

        public static int Power2(uint v)
        {
            if (v == 0)
                return 1;
            else
                return (int)(2 << (int)(v - 1));
        }

        public static int Power2(int v)
        {
            if (v == 0)
                return 1;
            else
                return (int)(2 << (int)(v - 1));
        }

        static byte[] BitReverseTable256 = new byte[256];

        static void R2(ref int i, byte v)
        {
            BitReverseTable256[i++] = v;
            BitReverseTable256[i++] = (byte)(v + 2 * 64);
            BitReverseTable256[i++] = (byte)(v + 1 * 64);
            BitReverseTable256[i++] = (byte)(v + 3 * 64);
        }

        static void R4(ref int i, byte v)
        {
            R2(ref i, v);
            R2(ref i, (byte)(v + 2 * 16));
            R2(ref i, (byte)(v + 1 * 16));
            R2(ref i, (byte)(v + 3 * 16));
        }

        static void R6(ref int i, byte v)
        {
            R4(ref i, v);
            R4(ref i, (byte)(v + 2 * 4));
            R4(ref i, (byte)(v + 1 * 4));
            R4(ref i, (byte)(v + 3 * 4));
        }

        static void prep_reverse_bits()
        {
            int i = 0;
            R6(ref i, 0);
            R6(ref i, 2);
            R6(ref i, 1);
            R6(ref i, 3);
        }

        public static byte ReverseBits(byte from)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            return BitReverseTable256[from];
        }

        public static Int32 ReverseBits(Int32 from)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            Int32 result = 0;
            for (int i = 0; i < sizeof(Int32); ++i)
            {
                result = result << 8;
                result |= BitReverseTable256[(byte)(from & 0xff)];
                from = from >> 8;
            }
            return result;
        }

        public static UInt32 ReverseBits(UInt32 from)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            UInt32 result = 0;
            for (int i = 0; i < sizeof(UInt32); ++i)
            {
                result = result << 8;
                result |= BitReverseTable256[(byte)(from & 0xff)];
                from = from >> 8;
            }
            return result;
        }

        static int Ones(uint x)
        {
            // 32-bit recursive reduction using SWAR...  but first step is mapping 2-bit values
            // into sum of 2 1-bit values in sneaky way
            x -= ((x >> 1) & 0x55555555);
            x = (((x >> 2) & 0x33333333) + (x & 0x33333333));
            x = (((x >> 4) + x) & 0x0f0f0f0f);
            x += (x >> 8);
            x += (x >> 16);
            return (int)(x & 0x0000003f);
        }

        public static int xFloorLog2(uint x)
        {
            x |= (x >> 1);
            x |= (x >> 2);
            x |= (x >> 4);
            x |= (x >> 8);
            x |= (x >> 16);
            return (Bithacks.Ones(x) - 1);
        }

        public static int Log2(uint x)
        {
            return FloorLog2(x);
        }

        public static int Log2(int x)
        {
            return FloorLog2((uint)x);
        }


    }

    [TestClass]
    public class Reduction
    {
        [TestMethod]
        public void ReductionT()
        {
            int n = Bithacks.Power2(10);
            int result_gpu = 0;
            int result_cpu = 0;
            {
                int[] data = new int[n];
                Campy.Parallel.For(n, idx => data[idx] = 1);
                for (int level = 1; level <= Bithacks.Log2(n); level++)
                {
                    int step = Bithacks.Power2(level);
                    Campy.Parallel.For(new Extent(n / step), idx =>
                    {
                        var i = step * idx;
                        data[i] = data[i] + data[i + step / 2];
                    });
                }
                result_gpu = data[0];
            }
            {
                int[] data = new int[n];
                for (int idx = 0; idx < n; ++idx) data[idx] = 1;
                for (int level = 1; level <= Bithacks.Log2(n); level++)
                {
                    int step = Bithacks.Power2(level);
                    for (int idx = 0; idx < n / step; idx++)
                    {
                        var i = step * idx;
                        data[i] = data[i] + data[i + step / 2];
                    }
                }
                result_cpu = data[0];
            }
            if (result_gpu != result_cpu) throw new Exception();
        }
    }
}

FFT

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using System.Linq;
using System.Numerics;

namespace FFT
{
    [TestClass]
    public class UnitTest1
    {
        /* Performs a Bit Reversal Algorithm on a postive integer 
         * for given number of bits
         * e.g. 011 with 3 bits is reversed to 110
         */
        public static int BitReverse(int n, int bits)
        {
            int reversedN = n;
            int count = bits - 1;

            n >>= 1;
            while (n > 0)
            {
                reversedN = (reversedN << 1) | (n & 1);
                count--;
                n >>= 1;
            }

            return ((reversedN << count) & ((1 << bits) - 1));
        }

        /* Uses Cooley-Tukey iterative in-place algorithm with radix-2 DIT case
         * assumes no of points provided are a power of 2 */
        public static void FFT(Complex[] buffer)
        {

            int bits = (int)Math.Log(buffer.Length, 2);
            for (int j = 1; j < buffer.Length / 2; j++)
            {
                int swapPos = BitReverse(j, bits);
                var temp = buffer[j];
                buffer[j] = buffer[swapPos];
                buffer[swapPos] = temp;
            }

            for (int N = 2; N <= buffer.Length; N <<= 1)
            {
                for (int i = 0; i < buffer.Length; i += N)
                {
                    for (int k = 0; k < N / 2; k++)
                    {
                        int evenIndex = i + k;
                        int oddIndex = i + k + (N / 2);
                        var even = buffer[evenIndex];
                        var odd = buffer[oddIndex];

                        double term = -2 * Math.PI * k / (double)N;
                        Complex exp = new Complex(Math.Cos(term), Math.Sin(term)) * odd;

                        buffer[evenIndex] = even + exp;
                        buffer[oddIndex] = even - exp;
                    }
                }
            }
        }

        public static void FFTGPU(Complex[] buffer)
        {
            int bits = (int)Math.Log(buffer.Length, 2);

            Campy.Parallel.For(buffer.Length / 2 - 1, k =>
            {
                int j = k + 1;
                int swapPos = BitReverse(j, bits);
                var temp = buffer[j];
                buffer[j] = buffer[swapPos];
                buffer[swapPos] = temp;
            });

            for (int N = 2; N <= buffer.Length; N <<= 1)
            {
                int step = N / 2;
                int bstep = N;
                Campy.Parallel.For(buffer.Length / 2, d =>
                {
                    var k = d % step;
                    var i = N * (d / step);
                    var t = d % step + N * (d / step);
                    int evenIndex = t;
                    int oddIndex = t + step;

                    var even = buffer[evenIndex];
                    var odd = buffer[oddIndex];

                    double term = -2 * Math.PI * k / (double)N;
                    Complex exp = new Complex(Math.Cos(term), Math.Sin(term)) * odd;

                    buffer[evenIndex] = even + exp;
                    buffer[oddIndex] = even - exp;
                });
            }
        }

        bool ApproxEqual(double a, double b)
        {
            if (b > a)
                return (b - a) < 0.01;
            else
                return (a - b) < 0.01;
        }

        [TestMethod]
        public void TestMethod1()
        {
            Complex[] input = { 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };
            var copy = input.ToArray();

            FFTGPU(input);
            FFT(copy);

            for (int i = 0; i < input.Length; ++i)
            {
                if (!ApproxEqual(copy[i].Real, input[i].Real)) throw new Exception();
                if (!ApproxEqual(copy[i].Imaginary, input[i].Imaginary)) throw new Exception();
            }
        }
    }
}

 

Moving beyond the C++ AMP model

Up to this point, I had envisioned Campy as an API that would be similar to C++/AMP. However, I'm now thinking beyond that API, and looking for a simpler interface between CPU and GPU for C# and NET code than the C++/AMP model.

In particular, C++/AMP defines array<> and array_view<>, data types that abstract an array for the GPU. What I really would prefer is to be able to run C# code on a GPU with no markup/tags/wrappers/etc., and access all relevant C# data via the closure of all objects and code–including classes, structs, arrays, basic value types, etc., within the GPU that the code uses. In other words, almost zero boilerplate code involved to perform parallel computations on the GPU. However, this is easier said than done.

Take for example reduction (1). Here is a simple, clean, in-place implementation of reduction on an array of integers with binary operator '+' in a new, experimental version of Campy.

int n = Bithacks.Power2(20);
int[] data = new int[n];
Extent e = new Extent(n);

Campy.Parallel.For(new Extent(n), idx => data[idx] = 1);
for (int level = 1; level <= Bithacks.Log2(n); level++)
{
    int step = Bithacks.Power2(level);
    Campy.Parallel.For(new Extent(n / step), idx =>
    {
        var i = step * idx;
        data[i] = data[i] + data[i + step / 2];
    });
}

for (int i = 0; i < data.Length; ++i)
    System.Console.WriteLine(data[i]);

In this code, the first parallel for-loop initializes data; in the second parallel for-loop, it performs a sum using data and step. Variables data and step are shared between the CPU and GPU. At runtime, a delegate is created containing the function of the lambda expression passed to a Parallel.For() method. The delegate contains a target object for each variable used in the lambda. i.e., the closure (2). Implicit in this code is a SIMT computing model with shared memory so data structures can be shared. The final result of the sum is contained in data[0].

This example illustrates several problems. We note that the implementation of the C# data structures can vary with different implementations of the NET runtime. The GPU code must be aware of this implementation, or it must marshal the objects into an appropriate representation for the GPU. Many operations of the C# data structures rely on the NET runtime used. So, the GPU must be able to JIT the NET runtime. Complications of sharing data include alignment issues (64-bit fetch must be aligned on 64-bit boundaries; 3, 4, 5).  C# objects are allocated using the C# virtual memory manager, which we do not have any control of (6). The memory manager allocates objects from a heap, garbage collection of stale objects at any time.

As a first step into sharing data structures, let's assume we are accessing only value types. That is user-defined arrays and structures that contain only other value types–integers, characters, booleans, arrays, and structs. This limits the difficulties associated in the JIT of the NET runtime. Further, as we do not have access to the memory allocator/garbage collector, let us also assume the CIL does not contain calls to the memory system. These restrictions are assumed in ILGPU and Alea GPU.

With these limitations, a simple solution which I am working towards is that C# data structures can be converted to equivalent blittable types (7, 8), then deep copied to and from GPU memory. Unfortunately, this means a bit of copying to and from unmanaged memory, currently using memory allocated via cuMemAllocManaged. Remember, reference types–which is what most programmers use–are not handled. A simple example with Campy restricted to value types is available (9).

Alea GPU Version 3 provides a very similar interface to Campy, where kernel delegates are passed to a Parallel.For()-like routine for JITing and running. In Version 3, the attribute "[GpuManaged]" is used to alleviate the issues in synchronize data between CPU and GPU. ILGPU Version 0.1.4 does not handle reference types (i.e., classes).

–Ken Domino

Notes and References

(1) Reduction

  • REDUCE(⊕, A): The REDUCE operation takes a binary operator ⊕ with identity I, and an ordered set A = [a0, a1, …, an−1] of n elements. The value of REDUCE(⊕, A) is (((a0 ⊕ a1) ⊕ … ) ⊕ an−1).
  • Since addition is commutative and associative for integers, REDUCE can be computed in a parallel manner. One implementation is an in-place method, which modifies the input set. It consists of a nested for-loop within an outer for-loop as shown in the Campy code described above. The inner for-loop can be performed in parallel in pairs.

(2) Lambda Expressions (C# Programming Guide). Microsoft Documentation. https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/statements-expressions-operators/lambda-expressions

(3) Coding for Performance: Data alignment and structures. https://software.intel.com/en-us/articles/coding-for-performance-data-alignment-and-structures

(4) Data Alignment when Migrating to 64-Bit Intel® Architecture. https://software.intel.com/en-us/articles/data-alignment-when-migrating-to-64-bit-intel-architecture

(5) How Memory Is Accessed. https://software.intel.com/en-us/articles/how-memory-is-accessed

(6) https://github.com/dotnet/coreclr/issues/1235

(7) Puran Mehram, P. Managed code and unmanaged code in .NET. C# Corner. http://www.c-sharpcorner.com/uploadfile/puranindia/managed-code-and-unmanaged-code-in-net/

(8) Parsons, J. Changes to Blittable Types. GitHubGist. https://gist.github.com/jaredpar/cecc2f5fd76b70e480450296d4c9914e . Accessed Aug 29, 2017.

(9) You can look at and try the working test in Github.com. Note, it does not yet run the Reduction example due to alignment problems.

# git clone https://github.com/kaby76/Campy.git

# cd Campy; git checkout -b proof-of-concept

 

Two steps forward, one step back…

Campy is moving forward, albeit slowly. I have written new classes that models the C++ AMP more fully, and have a run code for a simple “Hello World” example on a NVIDIA GPU.

As it turns out, the package I was using, ManagedCUDA, won’t support Net Standard until that comes out, which might be Fall 2017–but maybe never, as I have checked Net Core 2.0/Net Standard Pre-release 1, and the Net Standard 2.0 package does not work on a simple example. So, I will be taking extra time to write an SWIG-generated library for the CUDA Driver API. It’s likely I will need a similar AMD GPU driver library as well but will do that after a release of Campy that works with NVIDIA GPUs.

In the meanwhile, I learned that there is another library available that is similar to Campy, namely ILGPU. It looks similar, but IMHO it is missing pieces of the C++ AMP programming model. However, you might want to look at that.

–Ken Domino