Progress with a type system

After getting fed up debugging the GPU BCL code using printf’s, I finally have NVIDIA Nsight working with Campy–at least partially.

One problem was that all the examples I wrote always executed the program in the directory containing the executable.  So, Campy examples would always work–by magic. However, if I tried to debug a Campy program using Nsight, it would always fail on a call to the Mono Cecil method Resolve(). Nsight implements debugging using a client/server model, which is pretty much how all debuggers work. However, the server would not honor executing the program in the location specified. Instead, it would execute the assembly from some directory other than where the test program resided. As it turns out, Mono Cecil requires an Assembly Resolver so Resolve() would find dependent assemblies. Adding code for a resolver finally fixed the problem of debugging Campy using Nsight.

A second problem was that Nsight didn’t understand the debugging information in the PTX files generated when I compile the BCL CUDA source. I partially fixed this so I can at least set breakpoints, and step through kernels by changing the NVCC compiler options to interleave source in PTX (–source-in-ptx), not generate GPU debug information (no -G), generate line number information (-lineinfo). The other options I use are –keep, -rdc=true, –compile, -machine 64, -cudart static, compute_35,sm_35. I tried various options in cuModuleLoadDataEx with PTX files produced with -G, but to no avail. But, there could be a problem with my CUDA C# library Swigged.CUDA, where the option values may not be passed correctly.

Third, CUDA programs execute with a small runtime stack size, so allocating buffers like char buf[2000]; blows the stack very quickly.

Although the GPU BCL type system is getting closer to working, it still doesn’t. More hacking required.

–Ken Domino

Caught between a rock and a hard place

The BCL for Campy is, slowly, being integrated and tested. Unfortunately, I just found out that the NVIDIA CUDA 9 GPU Toolkit does not support Kepler GPUs–which is what my 5-year old laptop has, even though it is still a fine machine. But, CUDA 8 GPU Toolkit does not support Visual Studio 2017. When I install the older Visual Studio 2015 and try to build the BCL, I find out that it cannot compile functions that use ellipsis:

__device__ void Crash(char *pMsg, ...) {}

error : a __device__ or __host__ __device__ function cannot have ellipsis

Removing the ellipsis syntax from the BCL source would require a lot of changes, which in the long run, doesn’t do anything for the BCL, except make things more unreadable. It is likely fewer people will be using Kepler cards (e.g., K80) as people are moving onto Pascal GPUs (e.g., P100). Therefore, Campy is going to require Maxwell or newer GPUs and Visual Studio 2017.

Secondly, it looks like the code for the BCL type system isn’t working. To get instructions like ldstr and newobj working, a functioning reflection type system is needed on the GPU. DotNetAnywhere has 300 lines of C-code to read an assembly (which I encode at the moment as a large byte array, avoiding fopen), and extract the metadata for the assembly. Unfortunately, after what seemed link endless cycles of debugging the CUDA code using printf, I ran the code on the CPU and found that it doesn’t work because it is designed for a 32-bit address target, whereas Campy is targeted for 64-bit programming. Not being an expert on PE file format, I’ll need to take some time to fix this code. So much for free software.

A NET GPU Base Class Library

In my last post, I mentioned that Campy was able to JIT quite a bit, but failed to JIT many kernels because a NET base class library for the GPU was required. I’m happy to say this is now corrected. The DotNetAnywhere runtime has been ported to the NVIDIA GPU, and the simple string search example I mentioned in the last blog post now works.

The GPU BCL consists of 13 thousand lines of CUDA code (in 41 .CU files, which is compiled by NVCC to generate 56 thousand lines of PTX code that is loaded when executing a kernel), and 24 thousand lines of C# code. When a kernel is run, Campy rewrites the kernel to use the GPU BCL. I still haven’t gotten over first seeing the string search kernel compile and run with all of this baggage–uh, runtime!

While an important step, there is much work to do, the least of which is a parallel memory allocator/garbage collector that will work on the GPU.

Ken

I’ve a feeling we’re not in Kansas anymore

Now that Campy can compile simple value and reference types, the next issue is the runtime. Even on simple examples, when Campy decompiles a kernel in order to JIT it for the GPU, it rapidly discovers that the kernel references the NET runtime in most C# code.

In all NET runtimes, while most of the source is written in C#–which can be JIT’ed by Campy–there is a small portion of the Base Class Library  (see ECMA-334, Appendix D) that is also written in C source code, and therefore cannot be JIT’ed by Campy. Furthermore, depending on how a program that uses Campy is run, Campy decompiles different runtimes for the same kernel source code.

Let’s take a simple string search example, which searches for all occurrences of a pattern in another using a brute-force algorithm.

string text = ...;
string pattern = ...;
int[] found = ...;
Campy.Parallel.For(n, i =>
{
    int j;
    for (j = 0; j < m && i + j < n; j++)
        if (text[i + j] != pattern[j])
            break;
    // mismatch found, break the inner loop
    if (j == m)
    {
        // match found
        found[i] = 1;
    }
});

When built with NET Framework 4.7, Campy discovers the kernel method and the BCL method String.get_Chars(). However, get_Chars() has no CIL body as it is a native C function, which we also know because it is tagged “Is Internal” when using reflection.

If the program is run under Mono, Campy discovers a different implementation, all being in CIL, which can be JIT’ed.

The result is a richly textured landscape of different runtimes that need to be accommodated. That said, much of the runtime just doesn’t make any sense on the GPU: networking, graphics drawing, etc.

After reading some of the source code for Coreclr, Corert, Mono, NET Micro Framework, Dot Net Anywhere, I’ve come to the conclusion that I cannot simply replace the function calls to C functions of the runtime with unsafe C# code, which I have been doing so far. This cannot work as there must be an agreement between the implementation of the data structures in the BCL presumed between C# and C source.

Therefore, Campy must implement a BCL specific for it. Fortunately, there are plenty of examples to start from. Unfortunately, while CUDA C++ on Windows can compile C++11/14, it requires all device-runnable functions to be tagged with the __device__ qualifier, a rather annoying requirement in CUDA.

 

Advancing with value and reference types, and onto a runtime

There has been significant progress of Campy. It is now starting to compile both value types and reference types, and small bits of the NET runtime. For value types, it compiles the usual ints, floats, doubles, etc., but structs as well. For methods, it compiles static and non-virtual functions. Support for NET runtime is still next to nil because I cannot get Mono Cecil to decompile the bytecode of the NET runtime.

It hasn’t been easy getting to this stage. Despite years in compiler development, and many more years as a developer, I haven’t kept up with the latest tech. Most compiler writers now to use LLVM, which I wasn’t at all familiar with, tending to write my own code for everything. The sparsity of examples showing how to use LLVM-C didn’t help. Fortunately, I am coming up to speed, and when I get some time to breath, I plan on writing some very basic straight line code that will function as both examples and unit tests of LLVM-C. It’s taken two months to write a thunking layer to get C# to talk to LLVM-C, a month to write a thunking layer to get C# to talk to the CUDA Driver API, and now three months of 12+ hours per day/7 days per week to get the compiler to translate CIL into NVIDIA PTX, copy data structures between C# and an internal representation, and running the GPU code.

Two important examples now work: reduction over integer addition and the Fast Fourier Transform. The later uses the System.Numerics.Complex, which is part of the NET runtime. Complex is a struct (i.e., value type). This example cannot be compiled by any of the other C#/GPU compilers out there (Alea GPU, ILGPU). Note, the combination of the time for JIT compilation and the deep data structure copying to/from pinned global memory for the GPU makes for a very slow implementation of algorithms for the GPU. But, I have plans to fix this.

Speaking of which, it turns out there is quite a bit of NET runtime that must be written as a drop-in replacement for certain types one uses. For example, you can index a string to get at characters, but this calls a NET runtime method for P/Invoke functions. There are basically two runtimes from which to choose: Mono or Net Core. At this point, I am not quite sure which will be best.

I’ve been making semi-regular releases of Campy for 4.7 Net Framework programs on Windows systems with NVIDIA GPU’s. It’s still has a long way to go, but it’s very encouraging to see some rather complex examples working.

Given the progress in Campy, I’m planning on going to NVIDIA’s GPU Technology Conference in March 2018. In fact, I will be submitting a proposal for a presentation on Campy for the conference. I hope my proposal is accepted, and I have a chance to meet other C#/GPU developers at the conference.

Here is the code for reduction and the FFT for you to get an idea of what things are looking like.

Reduction

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using Campy;

namespace Reduction
{
    public class Bithacks
    {
        static bool preped;

        static int[] LogTable256 = new int[256];

        static void prep()
        {
            LogTable256[0] = LogTable256[1] = 0;
            for (int i = 2; i < 256; i++)
            {
                LogTable256[i] = 1 + LogTable256[i / 2];
            }
            LogTable256[0] = -1; // if you want log(0) to return -1

            // Prepare the reverse bits table.
            prep_reverse_bits();
        }

        public static int FloorLog2(uint v)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            int r; // r will be lg(v)
            uint tt; // temporaries

            if ((tt = v >> 24) != 0)
            {
                r = (24 + LogTable256[tt]);
            }
            else if ((tt = v >> 16) != 0)
            {
                r = (16 + LogTable256[tt]);
            }
            else if ((tt = v >> 8) != 0)
            {
                r = (8 + LogTable256[tt]);
            }
            else
            {
                r = LogTable256[v];
            }
            return r;
        }

        public static long FloorLog2(ulong v)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            long r; // r will be lg(v)
            ulong tt; // temporaries

            if ((tt = v >> 56) != 0)
            {
                r = (56 + LogTable256[tt]);
            }
            else if ((tt = v >> 48) != 0)
            {
                r = (48 + LogTable256[tt]);
            }
            else if ((tt = v >> 40) != 0)
            {
                r = (40 + LogTable256[tt]);
            }
            else if ((tt = v >> 32) != 0)
            {
                r = (32 + LogTable256[tt]);
            }
            else if ((tt = v >> 24) != 0)
            {
                r = (24 + LogTable256[tt]);
            }
            else if ((tt = v >> 16) != 0)
            {
                r = (16 + LogTable256[tt]);
            }
            else if ((tt = v >> 8) != 0)
            {
                r = (8 + LogTable256[tt]);
            }
            else
            {
                r = LogTable256[v];
            }
            return r;
        }

        public static int CeilingLog2(uint v)
        {
            int r = Bithacks.FloorLog2(v);
            if (r < 0)
                return r;
            if (v != (uint)Bithacks.Power2((uint)r))
                return r + 1;
            else
                return r;
        }

        public static int Power2(uint v)
        {
            if (v == 0)
                return 1;
            else
                return (int)(2 << (int)(v - 1));
        }

        public static int Power2(int v)
        {
            if (v == 0)
                return 1;
            else
                return (int)(2 << (int)(v - 1));
        }

        static byte[] BitReverseTable256 = new byte[256];

        static void R2(ref int i, byte v)
        {
            BitReverseTable256[i++] = v;
            BitReverseTable256[i++] = (byte)(v + 2 * 64);
            BitReverseTable256[i++] = (byte)(v + 1 * 64);
            BitReverseTable256[i++] = (byte)(v + 3 * 64);
        }

        static void R4(ref int i, byte v)
        {
            R2(ref i, v);
            R2(ref i, (byte)(v + 2 * 16));
            R2(ref i, (byte)(v + 1 * 16));
            R2(ref i, (byte)(v + 3 * 16));
        }

        static void R6(ref int i, byte v)
        {
            R4(ref i, v);
            R4(ref i, (byte)(v + 2 * 4));
            R4(ref i, (byte)(v + 1 * 4));
            R4(ref i, (byte)(v + 3 * 4));
        }

        static void prep_reverse_bits()
        {
            int i = 0;
            R6(ref i, 0);
            R6(ref i, 2);
            R6(ref i, 1);
            R6(ref i, 3);
        }

        public static byte ReverseBits(byte from)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            return BitReverseTable256[from];
        }

        public static Int32 ReverseBits(Int32 from)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            Int32 result = 0;
            for (int i = 0; i < sizeof(Int32); ++i)
            {
                result = result << 8;
                result |= BitReverseTable256[(byte)(from & 0xff)];
                from = from >> 8;
            }
            return result;
        }

        public static UInt32 ReverseBits(UInt32 from)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            UInt32 result = 0;
            for (int i = 0; i < sizeof(UInt32); ++i)
            {
                result = result << 8;
                result |= BitReverseTable256[(byte)(from & 0xff)];
                from = from >> 8;
            }
            return result;
        }

        static int Ones(uint x)
        {
            // 32-bit recursive reduction using SWAR...  but first step is mapping 2-bit values
            // into sum of 2 1-bit values in sneaky way
            x -= ((x >> 1) & 0x55555555);
            x = (((x >> 2) & 0x33333333) + (x & 0x33333333));
            x = (((x >> 4) + x) & 0x0f0f0f0f);
            x += (x >> 8);
            x += (x >> 16);
            return (int)(x & 0x0000003f);
        }

        public static int xFloorLog2(uint x)
        {
            x |= (x >> 1);
            x |= (x >> 2);
            x |= (x >> 4);
            x |= (x >> 8);
            x |= (x >> 16);
            return (Bithacks.Ones(x) - 1);
        }

        public static int Log2(uint x)
        {
            return FloorLog2(x);
        }

        public static int Log2(int x)
        {
            return FloorLog2((uint)x);
        }


    }

    [TestClass]
    public class Reduction
    {
        [TestMethod]
        public void ReductionT()
        {
            int n = Bithacks.Power2(10);
            int result_gpu = 0;
            int result_cpu = 0;
            {
                int[] data = new int[n];
                Campy.Parallel.For(n, idx => data[idx] = 1);
                for (int level = 1; level <= Bithacks.Log2(n); level++)
                {
                    int step = Bithacks.Power2(level);
                    Campy.Parallel.For(new Extent(n / step), idx =>
                    {
                        var i = step * idx;
                        data[i] = data[i] + data[i + step / 2];
                    });
                }
                result_gpu = data[0];
            }
            {
                int[] data = new int[n];
                for (int idx = 0; idx < n; ++idx) data[idx] = 1;
                for (int level = 1; level <= Bithacks.Log2(n); level++)
                {
                    int step = Bithacks.Power2(level);
                    for (int idx = 0; idx < n / step; idx++)
                    {
                        var i = step * idx;
                        data[i] = data[i] + data[i + step / 2];
                    }
                }
                result_cpu = data[0];
            }
            if (result_gpu != result_cpu) throw new Exception();
        }
    }
}

FFT

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using System.Linq;
using System.Numerics;

namespace FFT
{
    [TestClass]
    public class UnitTest1
    {
        /* Performs a Bit Reversal Algorithm on a postive integer 
         * for given number of bits
         * e.g. 011 with 3 bits is reversed to 110
         */
        public static int BitReverse(int n, int bits)
        {
            int reversedN = n;
            int count = bits - 1;

            n >>= 1;
            while (n > 0)
            {
                reversedN = (reversedN << 1) | (n & 1);
                count--;
                n >>= 1;
            }

            return ((reversedN << count) & ((1 << bits) - 1));
        }

        /* Uses Cooley-Tukey iterative in-place algorithm with radix-2 DIT case
         * assumes no of points provided are a power of 2 */
        public static void FFT(Complex[] buffer)
        {

            int bits = (int)Math.Log(buffer.Length, 2);
            for (int j = 1; j < buffer.Length / 2; j++)
            {
                int swapPos = BitReverse(j, bits);
                var temp = buffer[j];
                buffer[j] = buffer[swapPos];
                buffer[swapPos] = temp;
            }

            for (int N = 2; N <= buffer.Length; N <<= 1)
            {
                for (int i = 0; i < buffer.Length; i += N)
                {
                    for (int k = 0; k < N / 2; k++)
                    {
                        int evenIndex = i + k;
                        int oddIndex = i + k + (N / 2);
                        var even = buffer[evenIndex];
                        var odd = buffer[oddIndex];

                        double term = -2 * Math.PI * k / (double)N;
                        Complex exp = new Complex(Math.Cos(term), Math.Sin(term)) * odd;

                        buffer[evenIndex] = even + exp;
                        buffer[oddIndex] = even - exp;
                    }
                }
            }
        }

        public static void FFTGPU(Complex[] buffer)
        {
            int bits = (int)Math.Log(buffer.Length, 2);

            Campy.Parallel.For(buffer.Length / 2 - 1, k =>
            {
                int j = k + 1;
                int swapPos = BitReverse(j, bits);
                var temp = buffer[j];
                buffer[j] = buffer[swapPos];
                buffer[swapPos] = temp;
            });

            for (int N = 2; N <= buffer.Length; N <<= 1)
            {
                int step = N / 2;
                int bstep = N;
                Campy.Parallel.For(buffer.Length / 2, d =>
                {
                    var k = d % step;
                    var i = N * (d / step);
                    var t = d % step + N * (d / step);
                    int evenIndex = t;
                    int oddIndex = t + step;

                    var even = buffer[evenIndex];
                    var odd = buffer[oddIndex];

                    double term = -2 * Math.PI * k / (double)N;
                    Complex exp = new Complex(Math.Cos(term), Math.Sin(term)) * odd;

                    buffer[evenIndex] = even + exp;
                    buffer[oddIndex] = even - exp;
                });
            }
        }

        bool ApproxEqual(double a, double b)
        {
            if (b > a)
                return (b - a) < 0.01;
            else
                return (a - b) < 0.01;
        }

        [TestMethod]
        public void TestMethod1()
        {
            Complex[] input = { 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };
            var copy = input.ToArray();

            FFTGPU(input);
            FFT(copy);

            for (int i = 0; i < input.Length; ++i)
            {
                if (!ApproxEqual(copy[i].Real, input[i].Real)) throw new Exception();
                if (!ApproxEqual(copy[i].Imaginary, input[i].Imaginary)) throw new Exception();
            }
        }
    }
}

 

Moving beyond the C++ AMP model

Up to this point, I had envisioned Campy as an API that would be similar to C++/AMP. However, I'm now thinking beyond that API, and looking for a simpler interface between CPU and GPU for C# and NET code than the C++/AMP model.

In particular, C++/AMP defines array<> and array_view<>, data types that abstract an array for the GPU. What I really would prefer is to be able to run C# code on a GPU with no markup/tags/wrappers/etc., and access all relevant C# data via the closure of all objects and code–including classes, structs, arrays, basic value types, etc., within the GPU that the code uses. In other words, almost zero boilerplate code involved to perform parallel computations on the GPU. However, this is easier said than done.

Take for example reduction (1). Here is a simple, clean, in-place implementation of reduction on an array of integers with binary operator '+' in a new, experimental version of Campy.

int n = Bithacks.Power2(20);
int[] data = new int[n];
Extent e = new Extent(n);

Campy.Parallel.For(new Extent(n), idx => data[idx] = 1);
for (int level = 1; level <= Bithacks.Log2(n); level++)
{
    int step = Bithacks.Power2(level);
    Campy.Parallel.For(new Extent(n / step), idx =>
    {
        var i = step * idx;
        data[i] = data[i] + data[i + step / 2];
    });
}

for (int i = 0; i < data.Length; ++i)
    System.Console.WriteLine(data[i]);

In this code, the first parallel for-loop initializes data; in the second parallel for-loop, it performs a sum using data and step. Variables data and step are shared between the CPU and GPU. At runtime, a delegate is created containing the function of the lambda expression passed to a Parallel.For() method. The delegate contains a target object for each variable used in the lambda. i.e., the closure (2). Implicit in this code is a SIMT computing model with shared memory so data structures can be shared. The final result of the sum is contained in data[0].

This example illustrates several problems. We note that the implementation of the C# data structures can vary with different implementations of the NET runtime. The GPU code must be aware of this implementation, or it must marshal the objects into an appropriate representation for the GPU. Many operations of the C# data structures rely on the NET runtime used. So, the GPU must be able to JIT the NET runtime. Complications of sharing data include alignment issues (64-bit fetch must be aligned on 64-bit boundaries; 3, 4, 5).  C# objects are allocated using the C# virtual memory manager, which we do not have any control of (6). The memory manager allocates objects from a heap, garbage collection of stale objects at any time.

As a first step into sharing data structures, let's assume we are accessing only value types. That is user-defined arrays and structures that contain only other value types–integers, characters, booleans, arrays, and structs. This limits the difficulties associated in the JIT of the NET runtime. Further, as we do not have access to the memory allocator/garbage collector, let us also assume the CIL does not contain calls to the memory system. These restrictions are assumed in ILGPU and Alea GPU.

With these limitations, a simple solution which I am working towards is that C# data structures can be converted to equivalent blittable types (7, 8), then deep copied to and from GPU memory. Unfortunately, this means a bit of copying to and from unmanaged memory, currently using memory allocated via cuMemAllocManaged. Remember, reference types–which is what most programmers use–are not handled. A simple example with Campy restricted to value types is available (9).

Alea GPU Version 3 provides a very similar interface to Campy, where kernel delegates are passed to a Parallel.For()-like routine for JITing and running. In Version 3, the attribute "[GpuManaged]" is used to alleviate the issues in synchronize data between CPU and GPU. ILGPU Version 0.1.4 does not handle reference types (i.e., classes).

–Ken Domino

Notes and References

(1) Reduction

  • REDUCE(⊕, A): The REDUCE operation takes a binary operator ⊕ with identity I, and an ordered set A = [a0, a1, …, an−1] of n elements. The value of REDUCE(⊕, A) is (((a0 ⊕ a1) ⊕ … ) ⊕ an−1).
  • Since addition is commutative and associative for integers, REDUCE can be computed in a parallel manner. One implementation is an in-place method, which modifies the input set. It consists of a nested for-loop within an outer for-loop as shown in the Campy code described above. The inner for-loop can be performed in parallel in pairs.

(2) Lambda Expressions (C# Programming Guide). Microsoft Documentation. https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/statements-expressions-operators/lambda-expressions

(3) Coding for Performance: Data alignment and structures. https://software.intel.com/en-us/articles/coding-for-performance-data-alignment-and-structures

(4) Data Alignment when Migrating to 64-Bit Intel® Architecture. https://software.intel.com/en-us/articles/data-alignment-when-migrating-to-64-bit-intel-architecture

(5) How Memory Is Accessed. https://software.intel.com/en-us/articles/how-memory-is-accessed

(6) https://github.com/dotnet/coreclr/issues/1235

(7) Puran Mehram, P. Managed code and unmanaged code in .NET. C# Corner. http://www.c-sharpcorner.com/uploadfile/puranindia/managed-code-and-unmanaged-code-in-net/

(8) Parsons, J. Changes to Blittable Types. GitHubGist. https://gist.github.com/jaredpar/cecc2f5fd76b70e480450296d4c9914e . Accessed Aug 29, 2017.

(9) You can look at and try the working test in Github.com. Note, it does not yet run the Reduction example due to alignment problems.

# git clone https://github.com/kaby76/Campy.git

# cd Campy; git checkout -b proof-of-concept

 

Two steps forward, one step back…

Campy is moving forward, albeit slowly. I have written new classes that models the C++ AMP more fully, and have a run code for a simple “Hello World” example on a NVIDIA GPU.

As it turns out, the package I was using, ManagedCUDA, won’t support Net Standard until that comes out, which might be Fall 2017–but maybe never, as I have checked Net Core 2.0/Net Standard Pre-release 1, and the Net Standard 2.0 package does not work on a simple example. So, I will be taking extra time to write an SWIG-generated library for the CUDA Driver API. It’s likely I will need a similar AMD GPU driver library as well but will do that after a release of Campy that works with NVIDIA GPUs.

In the meanwhile, I learned that there is another library available that is similar to Campy, namely ILGPU. It looks similar, but IMHO it is missing pieces of the C++ AMP programming model. However, you might want to look at that.

–Ken Domino

Inching towards an LLVM backend

Several months ago, I started working again on Campy. Since then, I’ve had to rewrite quite a bit of the source in order to make use of LLVM for code generation. So, Campy can now compile and execute a good subset for an x86/x64 target, which is a small but important step towards the goal of executing C# on a GPU. A significant part of the work was simply exposing a C# wrapper library for LLVM. That wrapper class targets Windows, Ubuntu, and Android, which as far as I can tell surpasses anything offered in NuGet. That library is Swigged.llvm, which you can find here.

The next few steps are to refine the compiler further and to target a NVIDIA GPU. In addition, I will need to write a layer to expose the SSA analysis in LLVM in order to take advantage of that to determine what functions/methods must be compiled for the GPU. There is much work to be done, but it is worth the effort.

CampyNET reboot…

Hi Folks,

After a long break, I’ve restarted development of CampyNET! I have been porting over the code to Visual Studio 2015. And, I will also port it to Visual Studio 2017 when VS 2017 is released in next month!

First of all, all the code now lives on Github.com: http://github.com/kaby76/campynet. I decided to move it to Github because that is the most popular Git website.

Second, there are a number of problems with the code with VS 2015 that I’ve fixed, but there are more to correct before I make a new release of the API. While making the port, I realized that much of the code in CampyNET for SSA was broken. I’ve fixed much of this. When you run CampyNET, there is some code that scans your environment for the C++ compiler and runtime libraries. That code is extremely fragile.  CampyNET also requires Uncrustify (http://uncrustify.sourceforge.net/) to format output C++ AMP kernel code; it really shouldn’t be a requirement.

Third, the distribution of CampyNET must be fixed. Up to now, CampyNET was distributed via a downloadable EXE file which you would run on your system. This is ridiculous! Instead, the package should be distributed via NuGet.

These problems need to be fixed before I make another release of the API sometime in the next month.

If you have questions, please send them directly to ken.domino <AT-SIGN> gmail.com.

Ken Domino

CLI-based APIs for GPGPU Computing

How does Campy compare to other CLI-based (e.g., C#, F#) APIs for GPGPU computing? The following is a partial list of what is currently available along with an example, which computes the even natural numbers.

Cudafy

Cudafy is an API targeting CUDA and OpenCL Programs and kernels are written in C# or other CIL languages. Cudafy scans the CIL, converts GPU code (labeled with the “[Cudafy]” attribute) into CUDA C++ or OpenCL, compiles the code into PTX, then run using CUDA.NET. Cudafy also uses Cloo (see below for more information on Cloo).

Note: This API is very good, and maintained fairly often. It is easy to use, and well documented.

Author: Nicholas Kopp; Download: http://cudafy.codeplex.com/  http://www.hybriddsp.com/Products/CUDAfyNET.aspx; Last updated Jan 2015; Started ~Jun 2011; Open source, GNU Library General Public License (LGPL).

Example:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Cudafy;
using Cudafy.Host;
using Cudafy.Translator;

namespace CudafyExample
{
    class Program
    {
        static void Main(string[] args)
        {
            CudafyModes.Target = eGPUType.Cuda;
            CudafyModes.DeviceId = 0;
            CudafyTranslator.Language = CudafyModes.Target == eGPUType.OpenCL ? eLanguage.OpenCL : eLanguage.Cuda;
            GPGPU gpu = CudafyHost.GetDevice(CudafyModes.Target, 0);
            eArchitecture arch = gpu.GetArchitecture();
            CudafyModule km = CudafyTranslator.Cudafy(arch);
            gpu.LoadModule(km);
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            int[] dev_a = gpu.Allocate<int>(a);
            for (int i = 0; i < N; i++)
                a[i] = i;
            gpu.CopyToDevice(a, dev_a);
            gpu.Launch(grid_size, block_size, fun, dev_a, N);
            gpu.CopyFromDevice(dev_a, a);
            gpu.Free(dev_a);
        }

        [Cudafy]
        public static void fun(GThread thread, int[] a, int N)
        {
            int tid = thread.blockIdx.x;
            if (tid < N)
            {
                a[tid] = a[tid] * 2;
            }
        }
    }
}

ManagedCUDA

ManagedCUDA is an API for the CUDA Driver API. The user writes kernels in CUDA C++, then compiles them into PTX via the CUDA compiler. ManagedCUDA executes the PTX kernels. Copying data to/from the GPU is achieved via an assignment, which is a nice simplification in syntax.

Note: This API is also well maintained, and easy to use. The main issue is that kernels must be written in CUDA C++, then compiled into PTX. If you’re looking for a way of writing kernels in C#, this solution isn’t for you. But, it is an excellent, straight forward API to the CUDA Driver API.

Author: kunzmi (Michael Kunz); Download: http://managedcuda.codeplex.com/; Last updated Feb 2015; Started ~ Jan 2011; Open source, GNU Library General Public License (LGPL).

Example:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using ManagedCuda;
using ManagedCuda.BasicTypes;
using ManagedCuda.VectorTypes;
using System.IO;
using System.Reflection;

namespace ManagedCudaExample
{
    class Program
    {
        static void Main(string[] args)
        {
            CudaContext ctx = new CudaContext(CudaContext.GetMaxGflopsDeviceId());
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            CudaDeviceVariable<int> dev_a = new CudaDeviceVariable<int>(N);
            for (int i = 0; i < N; i++)
                a[i] = i;
            dev_a = a;
            string resName = "kernel.ptx";
            string resNamespace = "ManagedCudaExample";
            string resource = resNamespace + "." + resName;
            Stream stream = Assembly.GetExecutingAssembly().GetManifestResourceStream(resource);
            if (stream == null) throw new ArgumentException("Kernel not found in resources.");
            CudaKernel kernel = ctx.LoadKernelPTX(stream, "kernel");
            int threadsPerBlock = block_size;
            kernel.BlockDimensions = block_size;
            kernel.GridDimensions = grid_size;
            kernel.Run(dev_a.DevicePointer, N);
            a = dev_a;
        }
    }
}

ManagedCUDA requires kernels built in PTX. The following is the kernel in CUDA C++, which can be compiled into PTX via the CUDA compiler with the –keep option.

//Includes for IntelliSense 
#include <cuda.h>
 
extern "C"  {
    //kernel code
	__global__ void kernel(int * a, int N)
	{
		int i = blockDim.x * blockIdx.x + threadIdx.x;
		if (i < N)
			a[i] = a[i] * 2;
	}
}

CUDA.NET

CUDA.NET is an API for the CUDA Driver API. Users write kernels in CUDA C++, compile them into PTX, then execute the PTX with CUDA.NET.

Download: http://www.cass-hpc.com/solutions/libraries/cuda-net; Last updated 2010; Started ~Apr 2008.

Notes: Supposedly, the code for CUDA.NET is open source, and has no license requirements. However, the archive page does not have a download link for source, just the binaries. I found the source in the Cudafy source tree. The latest build is for CUDA 3.0, which is very old (currently, NVIDIA is on CUDA 7.0). As I mentioned, the only source I could find is under Cudafy (Cudafy/3p/cuda.net3.0.0_win/src/…), which has been updated for Cudafy and CUDA 6.5. The documentation in the latest download, v3.0, is unreadable. The only other documentation I could find is for v2.0, http://www.hoopoe-cloud.com/files/cuda.net/2.0/CUDA.NET_2.0.pdf, but contains numerous inaccuracies.

Example:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using GASS.CUDA;
using GASS.CUDA.Types;
using GASS.Types;

namespace CudaNetExample
{
    class Program
    {
        static void Main(string[] args)
        {
            CUResult i0 = CUDADriver.cuInit(0);
            CUdevice dev = new CUdevice();
            CUResult i1 = CUDADriver.cuDeviceGet(ref dev, 0);
            CUcontext ctx = new CUcontext();
            CUResult i2 = CUDADriver.cuCtxCreate(ref ctx, 0, dev);

            CUdeviceptr p1 = new CUdeviceptr();
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            for (int i = 0; i < N; i++)
                a[i] = i;
            CUResult s1 = CUDADriver.cuMemAlloc(ref p1, N * sizeof(int));
            CUResult s2 = CUDADriver.cuMemcpyHtoD(p1, a, a.Length * sizeof(int));
            CUmodule mod = new CUmodule();
            CUResult s3 = CUDADriver.cuModuleLoad(ref mod, @"C:\Users\Ken\Documents\Visual Studio 2013\Projects\ManagedCudaExample\Win32Project1\Debug\kernel.ptx");
            CUfunction func = new CUfunction();
            CUResult s4 = CUDADriver.cuModuleGetFunction(ref func, mod, "kernel");

            CUResult r1 = CUDADriver.cuParamSeti(func, 0, (uint)p1.Pointer);
            CUResult r2 = CUDADriver.cuParamSeti(func, 4, (uint)N);
            CUResult r3 = CUDADriver.cuParamSetSize(func, 8);
            CUResult r4 = CUDADriver.cuFuncSetBlockShape(func, block_size, 1, 1);
            CUResult r5 = CUDADriver.cuLaunchGrid(func, grid_size, 1);
            CUResult r6 = CUDADriver.cuCtxSynchronize();

            CUResult f1 = CUDADriver.cuMemcpyDtoH(a, p1, a.Length * sizeof(int));
 
        }
    }
}

Alea GPU

Alea GPU is an API similar to Cudafy. Kernels are annotated with the “[AOTCompile]” attribute, which are then compiled for the GPU via LLVM.

Author: Daniel Egloff; Download: http://quantalea.com/homehttp://quantalea.com/static/app/tutorial/quick_start/quick_start_example.html; Last updated Mar 2015. Proprietary.

Example:

using System;
using Alea.CUDA;
using Alea.CUDA.Utilities;
using Alea.CUDA.IL;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace AleaExample
{
    class Program
    {
        [AOTCompile]
        static void kernel(deviceptr<int> a, int n)
        {
            var i = blockIdx.x * blockDim.x + threadIdx.x;
            if (i < n)
                a[i] = a[i] * 2;
        }

        static void Main(string[] args)
        {
            var inputs = Enumerable.Range(0, 1000000).Select(i => i).ToArray();
            var worker = Worker.Default;
            DeviceMemory<int> dInputs = worker.Malloc(inputs);
            const int blockSize = 256;
            var numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT;
            var gridSize = Math.Min(16 * numSm, Common.divup(inputs.Length, blockSize));
            var lp = new LaunchParam(gridSize, blockSize);
            worker.Launch<deviceptr<int>, int>(kernel, lp, dInputs.Ptr, inputs.Length);
        }
    }
}

Note: I could not get the example to compile and run. The Alea GPU compiler requires a license, which I could not get because the email confirmation link produced an error on the company’s server.

CudaSharp

In CudaSharp, users write kernels in C#. The GPU code is converted into PTX via LLVM, then executed via the ManagedCUDA API. Kernels can be lambdas.

Download: https://github.com/khyperia/CudaSharp; Last updated Jan 2014; Started Jan 2014; Open source, no license. Author: khyperia – Evan Huack

Note: This project is incomplete. The source code for the API does not build. Further, while in principle the code translates MS CIL, which is a stack-based assembly language, into LLVM, an SSA-based representation, it may not correct because there is no code that performs the aliasing of the stacks from multiple predecessors.

A similar project was written for a Masters Thesis by Soren Horup, Soren Juul, and Herik Larsen at Aalborg University in June 2011. The thesis is http://projekter.aau.dk/projekter/files/52819053/Report.pdf. However, I could not find the source code for the project.

GPU.NET

UK based company TidePowerd.com, started by Nicolas Beecroft and Jack Pappas, produced a CIL-based backend targeting CUDA GPUs. The website is no longer available, so the details aren’t clear. But, there are examples in https://github.com/tidepowerd/GPU.NET-Example-Projects, which indicate kernel markup with “[kernel]” attributes, and a high-level model of the GPU based upon CUDA C++.

GpuLinq

Written in F#, this LINQ-like API contains a compiler that runs queries on the GPU. https://github.com/nessos/GpuLinq/ The current status of the project is unknown.

Brahma

A project of the Software Engineering Research Group at Saint-Petersburg State University of the Russian Federation, the code supposedly was an API that targeted OpenCL. However, it’s no longer available. There is a similar project Brahma.FSharp, which seems to be active (last updated 2014), but the details are unclear. https://sites.google.com/site/semathsrprojects/home

Microsoft Accelerator

This project offered a LINQ-like API contains a compiler that runs queries on the GPU. The status of the project is unclear. http://research.microsoft.com/en-us/projects/Accelerator/

OpenCL.NET

OpenCL is an API for multicore computing, whether it be on a CPU or GPU. The API is platform and device neutral. OpenCL.NET is an API which is a thin wrapper on top of the OpenCL.dll assembly. OpenCL requires one to write the kernels in C. These are compiled, linked, loaded and run by the OpenCL drivers. Note: OpenCL.NET has not be updated since Sep 2013. https://openclnet.codeplex.com/

Example:

using System;
using System.Diagnostics;
using System.Linq;
using OpenCL.Net.Extensions;
using OpenCL.Net;
using System.IO;
using System.Collections.Generic;

namespace Opencl_Device_Query
{
    class Program
    {
        static void Main(string[] args)
        {
            List<String> source_list = new List<string>();            
            try
            {
                using (StreamReader sr = new StreamReader("kernel.cl"))
                {
                    String line = sr.ReadToEnd();
                    source_list.Add(line);
                }
            }
            catch (Exception e)
            {
                Console.WriteLine("The file could not be read:");
                Console.WriteLine(e.Message);
                System.Environment.Exit(1);
            }
            String[] source = source_list.ToArray();
            IntPtr[] lengths = new IntPtr[source.Length];
            for (int i = 0; i < source.Length; ++i) lengths[i] = (IntPtr)source[i].Length;
            ErrorCode e1;
            Context context = Cl.CreateContext("NVIDIA CUDA", DeviceType.Gpu, out e1);
            OpenCL.Net.Program program = Cl.CreateProgramWithSource(context, 1, source, lengths, out e1);
            InfoBuffer ib = Cl.GetContextInfo(context, ContextInfo.Devices, out e1);
            Device[] devices = ib.CastToArray<Device>(ib.Size);
            e1 = Cl.BuildProgram(program, 1, devices, "", null, (IntPtr)0);
            if (e1 != ErrorCode.Success)
            {
                ib = Cl.GetProgramBuildInfo(program, devices[0], ProgramBuildInfo.Log, out e1);
                char[] log = ib.CastToArray<char>(ib.Size);
                System.Console.WriteLine(new String(log));
                System.Environment.Exit(1);
            }
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            for (int i = 0; i < N; i++)
                a[i] = i;
            IMem<int> dev_a = Cl.CreateBuffer<int>(context, MemFlags.ReadWrite, N, out e1);
            Kernel kernel = Cl.CreateKernel(program, "mykernel", out e1);
            e1 = Cl.SetKernelArg(kernel, 0, dev_a);
            e1 = Cl.SetKernelArg(kernel, 1, N);
            CommandQueue command_queue = Cl.CreateCommandQueue(context, devices[0], CommandQueueProperties.None, out e1);
            Event evt;
            e1 = Cl.EnqueueWriteBuffer(command_queue, dev_a, Bool.True, a, 0, null, out evt);
            IntPtr[] tiles;
            IntPtr[] tile_size;
            l2t(grid_size * block_size, 1, out tile_size, out tiles);
            e1 = Cl.EnqueueNDRangeKernel(command_queue, kernel, 1, null, (IntPtr[])tiles, (IntPtr[])tile_size, 0, null, out evt);
            e1 = Cl.EnqueueReadBuffer(command_queue, dev_a, Bool.True, a, 0, null, out evt);
        }

        static int the_blocksize = 256;

        static void l2t(int size, int max_dimensionality, out IntPtr[] tile_size, out IntPtr[] tiles)
        {
            tile_size = new IntPtr[3];
            tiles = new IntPtr[3];            
            for (int j = 0; j < max_dimensionality; ++j)
                tiles[j] = (IntPtr)1;
            int[] max_threads = new int[3]{ the_blocksize, 64, 64};
            int[] max_blocks = new int[3] { 65535, 65535, 65535 };
            for (int j = 0; j < max_dimensionality; ++j)
                tile_size[j] = (IntPtr)1;

            int b = size / (max_threads[0] * max_blocks[0]);
            if (b == 0)
            {
                b = size / max_threads[0];
                if (size % max_threads[0] != 0)
                    b++;

                if (b == 1)
                    max_threads[0] = size;

                // done. return the result.
                tiles[0] = (IntPtr)b;
                tile_size[0] = (IntPtr)max_threads[0];

                // OpenCL uses multiples of tile_size.
                tiles[0] = (IntPtr)((int)tile_size[0] * (int)tiles[0]);
                return;
            }

            int sqrt_size = (int)Math.Sqrt((float)size / max_threads[0]);
            sqrt_size++;

            int b2 = sqrt_size / max_blocks[1];
            if (b2 == 0)
            {
                b = sqrt_size;

                // done. return the result.
                tiles[0] = tiles[1] = (IntPtr)b;
                tile_size[0] = (IntPtr)max_threads[0];

                // OpenCL uses multiples of tile_size.
                tiles[0] = (IntPtr)((int)tile_size[0] * (int)tiles[0]);
                //        tiles[1] *= tile_size[1];
                return;
            }
        }
    }
}


//kernel code
__kernel void mykernel(__global int * a, int N)
{
    int i = get_global_id(0);
	if (i < N)
		a[i] = a[i] * 2;
}

Cloo

An OpenCL API. Note: Although I was able to get the example to work with Cloo, it wasn’t easy. There are no definitive examples nor documentation on the project website. Familiarity with the standard OpenCL API helped me to get something working. I found the API confusing for a couple reasons: sometimes one would use the “new” operator to create an object (e.g., ComputeBuffer), and other times a factory pattern (e.g., program.CreateKernel); sometimes one would access a static (e.g., ComputePlatform.Platforms), then alternatively call a method of an instance (e.g., program.GetBuildLog). Compare the examples of Cloo and OpenCL.NET to explore the differences. API was last updated in 2012.  http://cloo.sourceforge.net/ http://sourceforge.net/projects/cloo/

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Cloo;
using Cloo.Bindings;
using System.IO;
using System.Runtime.InteropServices;

namespace ClooSample
{
    class Program
    {
        static void Main(string[] args)
        {
            List<String> source_list = new List<string>();
            try
            {
                using (StreamReader sr = new StreamReader("kernel.cl"))
                {
                    String line = sr.ReadToEnd();
                    source_list.Add(line);
                }
            }
            catch (Exception e)
            {
                Console.WriteLine("The file could not be read:");
                Console.WriteLine(e.Message);
                System.Environment.Exit(1);
            }
            String[] source = source_list.ToArray();
            ComputePlatform cp = ComputePlatform.Platforms[1];
            ComputeContextPropertyList properties = new ComputeContextPropertyList(cp);
            ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu, properties, null, IntPtr.Zero);
            IReadOnlyCollection<ComputeDevice> devices = context.Devices;
            ComputeDevice[] devs = new ComputeDevice[1];
            devs[0] = devices.First();
            Cloo.ComputeProgram program = new ComputeProgram(context, source);
            try
            {
                program.Build(devs, "", (ComputeProgramBuildNotifier)null, IntPtr.Zero);
            }
            catch
            {
                String error = program.GetBuildLog(devs[0]);
                System.Console.WriteLine(error);
                System.Environment.Exit(1);
            }
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            for (int i = 0; i < N; i++)
                a[i] = i;
            Cloo.ComputeBuffer<int> deva = new ComputeBuffer<int>(context, ComputeMemoryFlags.ReadWrite, a);
            Cloo.ComputeKernel kernel = program.CreateKernel("mykernel");
            kernel.SetMemoryArgument(0, deva);
            kernel.SetValueArgument(1, N);
            Cloo.ComputeCommandQueue command_queue = new ComputeCommandQueue(context, devs[0], ComputeCommandQueueFlags.None);
            command_queue.WriteToBuffer(a, deva, true, null);
            long[] tiles;
            long[] tile_size;
            l2t(grid_size * block_size, 1, out tile_size, out tiles);
            command_queue.Execute(kernel, null, tiles, tile_size, null);
            command_queue.ReadFromBuffer(deva, ref a, true, null);
        }

        static int the_blocksize = 256;

        static void l2t(int size, int max_dimensionality, out long[] tile_size, out long[] tiles)
        {
            tile_size = new long[max_dimensionality];
            tiles = new long[max_dimensionality];
            for (int j = 0; j < max_dimensionality; ++j)
                tiles[j] = (long)1;
            int[] max_threads = new int[3] { the_blocksize, 64, 64 };
            int[] max_blocks = new int[3] { 65535, 65535, 65535 };
            for (int j = 0; j < max_dimensionality; ++j)
                tile_size[j] = (long)1;

            int b = size / (max_threads[0] * max_blocks[0]);
            if (b == 0)
            {
                b = size / max_threads[0];
                if (size % max_threads[0] != 0)
                    b++;

                if (b == 1)
                    max_threads[0] = size;

                // done. return the result.
                tiles[0] = (long)b;
                tile_size[0] = (long)max_threads[0];

                // OpenCL uses multiples of tile_size.
                tiles[0] = (long)((int)tile_size[0] * (int)tiles[0]);
                return;
            }

            int sqrt_size = (int)Math.Sqrt((float)size / max_threads[0]);
            sqrt_size++;

            int b2 = sqrt_size / max_blocks[1];
            if (b2 == 0)
            {
                b = sqrt_size;

                // done. return the result.
                tiles[0] = tiles[1] = (long)b;
                tile_size[0] = (long)max_threads[0];

                // OpenCL uses multiples of tile_size.
                tiles[0] = (long)((int)tile_size[0] * (int)tiles[0]);
                //        tiles[1] *= tile_size[1];
                return;
            }
        }
    }
}


Note: A file containing all these examples in MS Visual Studio 2013 projects is available https://domemtech.box.com/s/a472ynr3duu1vlle2i0dnwjpzclfd0k6.

Updated May 3, 2015