October 2017 – Campy

There has been significant progress of Campy. It is now starting to compile both value types and reference types, and small bits of the NET runtime. For value types, it compiles the usual ints, floats, doubles, etc., but structs as well. For methods, it compiles static and non-virtual functions. Support for NET runtime is still next to nil because I cannot get Mono Cecil to decompile the bytecode of the NET runtime.

It hasn’t been easy getting to this stage. Despite years in compiler development, and many more years as a developer, I haven’t kept up with the latest tech. Most compiler writers now to use LLVM, which I wasn’t at all familiar with, tending to write my own code for everything. The sparsity of examples showing how to use LLVM-C didn’t help. Fortunately, I am coming up to speed, and when I get some time to breath, I plan on writing some very basic straight line code that will function as both examples and unit tests of LLVM-C. It’s taken two months to write a thunking layer to get C# to talk to LLVM-C, a month to write a thunking layer to get C# to talk to the CUDA Driver API, and now three months of 12+ hours per day/7 days per week to get the compiler to translate CIL into NVIDIA PTX, copy data structures between C# and an internal representation, and running the GPU code.

Two important examples now work: reduction over integer addition and the Fast Fourier Transform. The later uses the System.Numerics.Complex, which is part of the NET runtime. Complex is a struct (i.e., value type). This example cannot be compiled by any of the other C#/GPU compilers out there (Alea GPU, ILGPU). Note, the combination of the time for JIT compilation and the deep data structure copying to/from pinned global memory for the GPU makes for a very slow implementation of algorithms for the GPU. But, I have plans to fix this.

Speaking of which, it turns out there is quite a bit of NET runtime that must be written as a drop-in replacement for certain types one uses. For example, you can index a string to get at characters, but this calls a NET runtime method for P/Invoke functions. There are basically two runtimes from which to choose: Mono or Net Core. At this point, I am not quite sure which will be best.

I’ve been making semi-regular releases of Campy for 4.7 Net Framework programs on Windows systems with NVIDIA GPU’s. It’s still has a long way to go, but it’s very encouraging to see some rather complex examples working.

Given the progress in Campy, I’m planning on going to NVIDIA’s GPU Technology Conference in March 2018. In fact, I will be submitting a proposal for a presentation on Campy for the conference. I hope my proposal is accepted, and I have a chance to meet other C#/GPU developers at the conference.

Here is the code for reduction and the FFT for you to get an idea of what things are looking like.

Reduction

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using Campy;

namespace Reduction
{
    public class Bithacks
    {
        static bool preped;

        static int[] LogTable256 = new int[256];

        static void prep()
        {
            LogTable256[0] = LogTable256[1] = 0;
            for (int i = 2; i < 256; i++)
            {
                LogTable256[i] = 1 + LogTable256[i / 2];
            }
            LogTable256[0] = -1; // if you want log(0) to return -1

            // Prepare the reverse bits table.
            prep_reverse_bits();
        }

        public static int FloorLog2(uint v)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            int r; // r will be lg(v)
            uint tt; // temporaries

            if ((tt = v >> 24) != 0)
            {
                r = (24 + LogTable256[tt]);
            }
            else if ((tt = v >> 16) != 0)
            {
                r = (16 + LogTable256[tt]);
            }
            else if ((tt = v >> 8) != 0)
            {
                r = (8 + LogTable256[tt]);
            }
            else
            {
                r = LogTable256[v];
            }
            return r;
        }

        public static long FloorLog2(ulong v)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            long r; // r will be lg(v)
            ulong tt; // temporaries

            if ((tt = v >> 56) != 0)
            {
                r = (56 + LogTable256[tt]);
            }
            else if ((tt = v >> 48) != 0)
            {
                r = (48 + LogTable256[tt]);
            }
            else if ((tt = v >> 40) != 0)
            {
                r = (40 + LogTable256[tt]);
            }
            else if ((tt = v >> 32) != 0)
            {
                r = (32 + LogTable256[tt]);
            }
            else if ((tt = v >> 24) != 0)
            {
                r = (24 + LogTable256[tt]);
            }
            else if ((tt = v >> 16) != 0)
            {
                r = (16 + LogTable256[tt]);
            }
            else if ((tt = v >> 8) != 0)
            {
                r = (8 + LogTable256[tt]);
            }
            else
            {
                r = LogTable256[v];
            }
            return r;
        }

        public static int CeilingLog2(uint v)
        {
            int r = Bithacks.FloorLog2(v);
            if (r < 0)
                return r;
            if (v != (uint)Bithacks.Power2((uint)r))
                return r + 1;
            else
                return r;
        }

        public static int Power2(uint v)
        {
            if (v == 0)
                return 1;
            else
                return (int)(2 << (int)(v - 1));
        }

        public static int Power2(int v)
        {
            if (v == 0)
                return 1;
            else
                return (int)(2 << (int)(v - 1));
        }

        static byte[] BitReverseTable256 = new byte[256];

        static void R2(ref int i, byte v)
        {
            BitReverseTable256[i++] = v;
            BitReverseTable256[i++] = (byte)(v + 2 * 64);
            BitReverseTable256[i++] = (byte)(v + 1 * 64);
            BitReverseTable256[i++] = (byte)(v + 3 * 64);
        }

        static void R4(ref int i, byte v)
        {
            R2(ref i, v);
            R2(ref i, (byte)(v + 2 * 16));
            R2(ref i, (byte)(v + 1 * 16));
            R2(ref i, (byte)(v + 3 * 16));
        }

        static void R6(ref int i, byte v)
        {
            R4(ref i, v);
            R4(ref i, (byte)(v + 2 * 4));
            R4(ref i, (byte)(v + 1 * 4));
            R4(ref i, (byte)(v + 3 * 4));
        }

        static void prep_reverse_bits()
        {
            int i = 0;
            R6(ref i, 0);
            R6(ref i, 2);
            R6(ref i, 1);
            R6(ref i, 3);
        }

        public static byte ReverseBits(byte from)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            return BitReverseTable256[from];
        }

        public static Int32 ReverseBits(Int32 from)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            Int32 result = 0;
            for (int i = 0; i < sizeof(Int32); ++i)
            {
                result = result << 8;
                result |= BitReverseTable256[(byte)(from & 0xff)];
                from = from >> 8;
            }
            return result;
        }

        public static UInt32 ReverseBits(UInt32 from)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            UInt32 result = 0;
            for (int i = 0; i < sizeof(UInt32); ++i)
            {
                result = result << 8;
                result |= BitReverseTable256[(byte)(from & 0xff)];
                from = from >> 8;
            }
            return result;
        }

        static int Ones(uint x)
        {
            // 32-bit recursive reduction using SWAR...  but first step is mapping 2-bit values
            // into sum of 2 1-bit values in sneaky way
            x -= ((x >> 1) & 0x55555555);
            x = (((x >> 2) & 0x33333333) + (x & 0x33333333));
            x = (((x >> 4) + x) & 0x0f0f0f0f);
            x += (x >> 8);
            x += (x >> 16);
            return (int)(x & 0x0000003f);
        }

        public static int xFloorLog2(uint x)
        {
            x |= (x >> 1);
            x |= (x >> 2);
            x |= (x >> 4);
            x |= (x >> 8);
            x |= (x >> 16);
            return (Bithacks.Ones(x) - 1);
        }

        public static int Log2(uint x)
        {
            return FloorLog2(x);
        }

        public static int Log2(int x)
        {
            return FloorLog2((uint)x);
        }


    }

    [TestClass]
    public class Reduction
    {
        [TestMethod]
        public void ReductionT()
        {
            int n = Bithacks.Power2(10);
            int result_gpu = 0;
            int result_cpu = 0;
            {
                int[] data = new int[n];
                Campy.Parallel.For(n, idx => data[idx] = 1);
                for (int level = 1; level <= Bithacks.Log2(n); level++)
                {
                    int step = Bithacks.Power2(level);
                    Campy.Parallel.For(new Extent(n / step), idx =>
                    {
                        var i = step * idx;
                        data[i] = data[i] + data[i + step / 2];
                    });
                }
                result_gpu = data[0];
            }
            {
                int[] data = new int[n];
                for (int idx = 0; idx < n; ++idx) data[idx] = 1;
                for (int level = 1; level <= Bithacks.Log2(n); level++)
                {
                    int step = Bithacks.Power2(level);
                    for (int idx = 0; idx < n / step; idx++)
                    {
                        var i = step * idx;
                        data[i] = data[i] + data[i + step / 2];
                    }
                }
                result_cpu = data[0];
            }
            if (result_gpu != result_cpu) throw new Exception();
        }
    }
}

FFT

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using System.Linq;
using System.Numerics;

namespace FFT
{
    [TestClass]
    public class UnitTest1
    {
        /* Performs a Bit Reversal Algorithm on a postive integer 
         * for given number of bits
         * e.g. 011 with 3 bits is reversed to 110
         */
        public static int BitReverse(int n, int bits)
        {
            int reversedN = n;
            int count = bits - 1;

            n >>= 1;
            while (n > 0)
            {
                reversedN = (reversedN << 1) | (n & 1);
                count--;
                n >>= 1;
            }

            return ((reversedN << count) & ((1 << bits) - 1));
        }

        /* Uses Cooley-Tukey iterative in-place algorithm with radix-2 DIT case
         * assumes no of points provided are a power of 2 */
        public static void FFT(Complex[] buffer)
        {

            int bits = (int)Math.Log(buffer.Length, 2);
            for (int j = 1; j < buffer.Length / 2; j++)
            {
                int swapPos = BitReverse(j, bits);
                var temp = buffer[j];
                buffer[j] = buffer[swapPos];
                buffer[swapPos] = temp;
            }

            for (int N = 2; N <= buffer.Length; N <<= 1)
            {
                for (int i = 0; i < buffer.Length; i += N)
                {
                    for (int k = 0; k < N / 2; k++)
                    {
                        int evenIndex = i + k;
                        int oddIndex = i + k + (N / 2);
                        var even = buffer[evenIndex];
                        var odd = buffer[oddIndex];

                        double term = -2 * Math.PI * k / (double)N;
                        Complex exp = new Complex(Math.Cos(term), Math.Sin(term)) * odd;

                        buffer[evenIndex] = even + exp;
                        buffer[oddIndex] = even - exp;
                    }
                }
            }
        }

        public static void FFTGPU(Complex[] buffer)
        {
            int bits = (int)Math.Log(buffer.Length, 2);

            Campy.Parallel.For(buffer.Length / 2 - 1, k =>
            {
                int j = k + 1;
                int swapPos = BitReverse(j, bits);
                var temp = buffer[j];
                buffer[j] = buffer[swapPos];
                buffer[swapPos] = temp;
            });

            for (int N = 2; N <= buffer.Length; N <<= 1)
            {
                int step = N / 2;
                int bstep = N;
                Campy.Parallel.For(buffer.Length / 2, d =>
                {
                    var k = d % step;
                    var i = N * (d / step);
                    var t = d % step + N * (d / step);
                    int evenIndex = t;
                    int oddIndex = t + step;

                    var even = buffer[evenIndex];
                    var odd = buffer[oddIndex];

                    double term = -2 * Math.PI * k / (double)N;
                    Complex exp = new Complex(Math.Cos(term), Math.Sin(term)) * odd;

                    buffer[evenIndex] = even + exp;
                    buffer[oddIndex] = even - exp;
                });
            }
        }

        bool ApproxEqual(double a, double b)
        {
            if (b > a)
                return (b - a) < 0.01;
            else
                return (a - b) < 0.01;
        }

        [TestMethod]
        public void TestMethod1()
        {
            Complex[] input = { 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };
            var copy = input.ToArray();

            FFTGPU(input);
            FFT(copy);

            for (int i = 0; i < input.Length; ++i)
            {
                if (!ApproxEqual(copy[i].Real, input[i].Real)) throw new Exception();
                if (!ApproxEqual(copy[i].Imaginary, input[i].Imaginary)) throw new Exception();
            }
        }
    }
}

Month: October 2017

I’ve a feeling we’re not in Kansas anymore

Advancing with value and reference types, and onto a runtime

Reduction

FFT