I’ve a feeling we’re not in Kansas anymore

Now that Campy can compile simple value and reference types, the next issue is the runtime. Even on simple examples, when Campy decompiles a kernel in order to JIT it for the GPU, it rapidly discovers that the kernel references the NET runtime in most C# code.

In all NET runtimes, while most of the source is written in C#–which can be JIT’ed by Campy–there is a small portion of the Base Class Library  (see ECMA-334, Appendix D) that is also written in C source code, and therefore cannot be JIT’ed by Campy. Furthermore, depending on how a program that uses Campy is run, Campy decompiles different runtimes for the same kernel source code.

Let’s take a simple string search example, which searches for all occurrences of a pattern in another using a brute-force algorithm.

string text = ...;
string pattern = ...;
int[] found = ...;
Campy.Parallel.For(n, i =>
{
    int j;
    for (j = 0; j < m && i + j < n; j++)
        if (text[i + j] != pattern[j])
            break;
    // mismatch found, break the inner loop
    if (j == m)
    {
        // match found
        found[i] = 1;
    }
});

When built with NET Framework 4.7, Campy discovers the kernel method and the BCL method String.get_Chars(). However, get_Chars() has no CIL body as it is a native C function, which we also know because it is tagged “Is Internal” when using reflection.

If the program is run under Mono, Campy discovers a different implementation, all being in CIL, which can be JIT’ed.

The result is a richly textured landscape of different runtimes that need to be accommodated. That said, much of the runtime just doesn’t make any sense on the GPU: networking, graphics drawing, etc.

After reading some of the source code for Coreclr, Corert, Mono, NET Micro Framework, Dot Net Anywhere, I’ve come to the conclusion that I cannot simply replace the function calls to C functions of the runtime with unsafe C# code, which I have been doing so far. This cannot work as there must be an agreement between the implementation of the data structures in the BCL presumed between C# and C source.

Therefore, Campy must implement a BCL specific for it. Fortunately, there are plenty of examples to start from. Unfortunately, while CUDA C++ on Windows can compile C++11/14, it requires all device-runnable functions to be tagged with the __device__ qualifier, a rather annoying requirement in CUDA.

 

Advancing with value and reference types, and onto a runtime

There has been significant progress of Campy. It is now starting to compile both value types and reference types, and small bits of the NET runtime. For value types, it compiles the usual ints, floats, doubles, etc., but structs as well. For methods, it compiles static and non-virtual functions. Support for NET runtime is still next to nil because I cannot get Mono Cecil to decompile the bytecode of the NET runtime.

It hasn’t been easy getting to this stage. Despite years in compiler development, and many more years as a developer, I haven’t kept up with the latest tech. Most compiler writers now to use LLVM, which I wasn’t at all familiar with, tending to write my own code for everything. The sparsity of examples showing how to use LLVM-C didn’t help. Fortunately, I am coming up to speed, and when I get some time to breath, I plan on writing some very basic straight line code that will function as both examples and unit tests of LLVM-C. It’s taken two months to write a thunking layer to get C# to talk to LLVM-C, a month to write a thunking layer to get C# to talk to the CUDA Driver API, and now three months of 12+ hours per day/7 days per week to get the compiler to translate CIL into NVIDIA PTX, copy data structures between C# and an internal representation, and running the GPU code.

Two important examples now work: reduction over integer addition and the Fast Fourier Transform. The later uses the System.Numerics.Complex, which is part of the NET runtime. Complex is a struct (i.e., value type). This example cannot be compiled by any of the other C#/GPU compilers out there (Alea GPU, ILGPU). Note, the combination of the time for JIT compilation and the deep data structure copying to/from pinned global memory for the GPU makes for a very slow implementation of algorithms for the GPU. But, I have plans to fix this.

Speaking of which, it turns out there is quite a bit of NET runtime that must be written as a drop-in replacement for certain types one uses. For example, you can index a string to get at characters, but this calls a NET runtime method for P/Invoke functions. There are basically two runtimes from which to choose: Mono or Net Core. At this point, I am not quite sure which will be best.

I’ve been making semi-regular releases of Campy for 4.7 Net Framework programs on Windows systems with NVIDIA GPU’s. It’s still has a long way to go, but it’s very encouraging to see some rather complex examples working.

Given the progress in Campy, I’m planning on going to NVIDIA’s GPU Technology Conference in March 2018. In fact, I will be submitting a proposal for a presentation on Campy for the conference. I hope my proposal is accepted, and I have a chance to meet other C#/GPU developers at the conference.

Here is the code for reduction and the FFT for you to get an idea of what things are looking like.

Reduction

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using Campy;

namespace Reduction
{
    public class Bithacks
    {
        static bool preped;

        static int[] LogTable256 = new int[256];

        static void prep()
        {
            LogTable256[0] = LogTable256[1] = 0;
            for (int i = 2; i < 256; i++)
            {
                LogTable256[i] = 1 + LogTable256[i / 2];
            }
            LogTable256[0] = -1; // if you want log(0) to return -1

            // Prepare the reverse bits table.
            prep_reverse_bits();
        }

        public static int FloorLog2(uint v)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            int r; // r will be lg(v)
            uint tt; // temporaries

            if ((tt = v >> 24) != 0)
            {
                r = (24 + LogTable256[tt]);
            }
            else if ((tt = v >> 16) != 0)
            {
                r = (16 + LogTable256[tt]);
            }
            else if ((tt = v >> 8) != 0)
            {
                r = (8 + LogTable256[tt]);
            }
            else
            {
                r = LogTable256[v];
            }
            return r;
        }

        public static long FloorLog2(ulong v)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            long r; // r will be lg(v)
            ulong tt; // temporaries

            if ((tt = v >> 56) != 0)
            {
                r = (56 + LogTable256[tt]);
            }
            else if ((tt = v >> 48) != 0)
            {
                r = (48 + LogTable256[tt]);
            }
            else if ((tt = v >> 40) != 0)
            {
                r = (40 + LogTable256[tt]);
            }
            else if ((tt = v >> 32) != 0)
            {
                r = (32 + LogTable256[tt]);
            }
            else if ((tt = v >> 24) != 0)
            {
                r = (24 + LogTable256[tt]);
            }
            else if ((tt = v >> 16) != 0)
            {
                r = (16 + LogTable256[tt]);
            }
            else if ((tt = v >> 8) != 0)
            {
                r = (8 + LogTable256[tt]);
            }
            else
            {
                r = LogTable256[v];
            }
            return r;
        }

        public static int CeilingLog2(uint v)
        {
            int r = Bithacks.FloorLog2(v);
            if (r < 0)
                return r;
            if (v != (uint)Bithacks.Power2((uint)r))
                return r + 1;
            else
                return r;
        }

        public static int Power2(uint v)
        {
            if (v == 0)
                return 1;
            else
                return (int)(2 << (int)(v - 1));
        }

        public static int Power2(int v)
        {
            if (v == 0)
                return 1;
            else
                return (int)(2 << (int)(v - 1));
        }

        static byte[] BitReverseTable256 = new byte[256];

        static void R2(ref int i, byte v)
        {
            BitReverseTable256[i++] = v;
            BitReverseTable256[i++] = (byte)(v + 2 * 64);
            BitReverseTable256[i++] = (byte)(v + 1 * 64);
            BitReverseTable256[i++] = (byte)(v + 3 * 64);
        }

        static void R4(ref int i, byte v)
        {
            R2(ref i, v);
            R2(ref i, (byte)(v + 2 * 16));
            R2(ref i, (byte)(v + 1 * 16));
            R2(ref i, (byte)(v + 3 * 16));
        }

        static void R6(ref int i, byte v)
        {
            R4(ref i, v);
            R4(ref i, (byte)(v + 2 * 4));
            R4(ref i, (byte)(v + 1 * 4));
            R4(ref i, (byte)(v + 3 * 4));
        }

        static void prep_reverse_bits()
        {
            int i = 0;
            R6(ref i, 0);
            R6(ref i, 2);
            R6(ref i, 1);
            R6(ref i, 3);
        }

        public static byte ReverseBits(byte from)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            return BitReverseTable256[from];
        }

        public static Int32 ReverseBits(Int32 from)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            Int32 result = 0;
            for (int i = 0; i < sizeof(Int32); ++i)
            {
                result = result << 8;
                result |= BitReverseTable256[(byte)(from & 0xff)];
                from = from >> 8;
            }
            return result;
        }

        public static UInt32 ReverseBits(UInt32 from)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            UInt32 result = 0;
            for (int i = 0; i < sizeof(UInt32); ++i)
            {
                result = result << 8;
                result |= BitReverseTable256[(byte)(from & 0xff)];
                from = from >> 8;
            }
            return result;
        }

        static int Ones(uint x)
        {
            // 32-bit recursive reduction using SWAR...  but first step is mapping 2-bit values
            // into sum of 2 1-bit values in sneaky way
            x -= ((x >> 1) & 0x55555555);
            x = (((x >> 2) & 0x33333333) + (x & 0x33333333));
            x = (((x >> 4) + x) & 0x0f0f0f0f);
            x += (x >> 8);
            x += (x >> 16);
            return (int)(x & 0x0000003f);
        }

        public static int xFloorLog2(uint x)
        {
            x |= (x >> 1);
            x |= (x >> 2);
            x |= (x >> 4);
            x |= (x >> 8);
            x |= (x >> 16);
            return (Bithacks.Ones(x) - 1);
        }

        public static int Log2(uint x)
        {
            return FloorLog2(x);
        }

        public static int Log2(int x)
        {
            return FloorLog2((uint)x);
        }


    }

    [TestClass]
    public class Reduction
    {
        [TestMethod]
        public void ReductionT()
        {
            int n = Bithacks.Power2(10);
            int result_gpu = 0;
            int result_cpu = 0;
            {
                int[] data = new int[n];
                Campy.Parallel.For(n, idx => data[idx] = 1);
                for (int level = 1; level <= Bithacks.Log2(n); level++)
                {
                    int step = Bithacks.Power2(level);
                    Campy.Parallel.For(new Extent(n / step), idx =>
                    {
                        var i = step * idx;
                        data[i] = data[i] + data[i + step / 2];
                    });
                }
                result_gpu = data[0];
            }
            {
                int[] data = new int[n];
                for (int idx = 0; idx < n; ++idx) data[idx] = 1;
                for (int level = 1; level <= Bithacks.Log2(n); level++)
                {
                    int step = Bithacks.Power2(level);
                    for (int idx = 0; idx < n / step; idx++)
                    {
                        var i = step * idx;
                        data[i] = data[i] + data[i + step / 2];
                    }
                }
                result_cpu = data[0];
            }
            if (result_gpu != result_cpu) throw new Exception();
        }
    }
}

FFT

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using System.Linq;
using System.Numerics;

namespace FFT
{
    [TestClass]
    public class UnitTest1
    {
        /* Performs a Bit Reversal Algorithm on a postive integer 
         * for given number of bits
         * e.g. 011 with 3 bits is reversed to 110
         */
        public static int BitReverse(int n, int bits)
        {
            int reversedN = n;
            int count = bits - 1;

            n >>= 1;
            while (n > 0)
            {
                reversedN = (reversedN << 1) | (n & 1);
                count--;
                n >>= 1;
            }

            return ((reversedN << count) & ((1 << bits) - 1));
        }

        /* Uses Cooley-Tukey iterative in-place algorithm with radix-2 DIT case
         * assumes no of points provided are a power of 2 */
        public static void FFT(Complex[] buffer)
        {

            int bits = (int)Math.Log(buffer.Length, 2);
            for (int j = 1; j < buffer.Length / 2; j++)
            {
                int swapPos = BitReverse(j, bits);
                var temp = buffer[j];
                buffer[j] = buffer[swapPos];
                buffer[swapPos] = temp;
            }

            for (int N = 2; N <= buffer.Length; N <<= 1)
            {
                for (int i = 0; i < buffer.Length; i += N)
                {
                    for (int k = 0; k < N / 2; k++)
                    {
                        int evenIndex = i + k;
                        int oddIndex = i + k + (N / 2);
                        var even = buffer[evenIndex];
                        var odd = buffer[oddIndex];

                        double term = -2 * Math.PI * k / (double)N;
                        Complex exp = new Complex(Math.Cos(term), Math.Sin(term)) * odd;

                        buffer[evenIndex] = even + exp;
                        buffer[oddIndex] = even - exp;
                    }
                }
            }
        }

        public static void FFTGPU(Complex[] buffer)
        {
            int bits = (int)Math.Log(buffer.Length, 2);

            Campy.Parallel.For(buffer.Length / 2 - 1, k =>
            {
                int j = k + 1;
                int swapPos = BitReverse(j, bits);
                var temp = buffer[j];
                buffer[j] = buffer[swapPos];
                buffer[swapPos] = temp;
            });

            for (int N = 2; N <= buffer.Length; N <<= 1)
            {
                int step = N / 2;
                int bstep = N;
                Campy.Parallel.For(buffer.Length / 2, d =>
                {
                    var k = d % step;
                    var i = N * (d / step);
                    var t = d % step + N * (d / step);
                    int evenIndex = t;
                    int oddIndex = t + step;

                    var even = buffer[evenIndex];
                    var odd = buffer[oddIndex];

                    double term = -2 * Math.PI * k / (double)N;
                    Complex exp = new Complex(Math.Cos(term), Math.Sin(term)) * odd;

                    buffer[evenIndex] = even + exp;
                    buffer[oddIndex] = even - exp;
                });
            }
        }

        bool ApproxEqual(double a, double b)
        {
            if (b > a)
                return (b - a) < 0.01;
            else
                return (a - b) < 0.01;
        }

        [TestMethod]
        public void TestMethod1()
        {
            Complex[] input = { 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };
            var copy = input.ToArray();

            FFTGPU(input);
            FFT(copy);

            for (int i = 0; i < input.Length; ++i)
            {
                if (!ApproxEqual(copy[i].Real, input[i].Real)) throw new Exception();
                if (!ApproxEqual(copy[i].Imaginary, input[i].Imaginary)) throw new Exception();
            }
        }
    }
}