Advancing with value and reference types

There has been significant progress of Campy. It is now starting to be able to compile both value types and reference types. For value types, it compiles the usual ints, floats, doubles, etc., but structs as well. For methods, it compiles static and non-virtual functions. Runtime support for NET is still, however, next to nil.

It hasn’t been easy getting to this stage. Despite years in compiler development, and many more years as a developer, I haven’t kept up with the latest tech. Most compiler writers now to use LLVM, which I wasn’t at all familiar with, tending to write my own code for everything. The sparsity of examples showing how to use LLVM-C didn’t help. Fortunately, I am coming up to speed, and when I get some time to breath, I plan on writing some very basic straight line code that will function as both examples and unit tests of LLVM-C. It’s taken two months to write a thunking layer to get C# to talk to LLVM-C, a month to write a thunking layer to get C# to talk to the CUDA Driver API, and now three months of 12+ hours per day/7 days per week to get the compiler to translate CIL into NVIDIA PTX, copy data structures between C# and an internal representation, and running the GPU code.

Two important examples now work: reduction over integer addition and the Fast Fourier Transform. The later uses the System.Numerics.Complex, which is part of the NET runtime. Complex is a struct (i.e., value type). This example cannot be compiled by any of the other C#/GPU compilers out there (Alea GPU, ILGPU). Note, the combination of the time for JIT compilation and the deep data structure copying to/from pinned global memory for the GPU makes for a very slow implementation of algorithms for the GPU. But, I have plans to fix this with a special runtime developed from either Mono or Net Core.

I’ve been making semi-regular releases of Campy for 4.7 Net Framework programs on Windows systems with NVIDIA GPU’s. It’s still has a long way to go, but it’s very encouraging to see some rather complex examples working.

Given the progress in Campy, I’m planning on going to NVIDIA’s GPU Technology Conference in March 2018. In fact, I will be submitting a proposal for a presentation on Campy for the conference. I hope my proposal is accepted, and I have a chance to meet other C#/GPU developers at the conference.

Here is the code for reduction and the FFT for you to get an idea of what things are looking like.

Reduction

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using Campy;

namespace Reduction
{
    public class Bithacks
    {
        static bool preped;

        static int[] LogTable256 = new int[256];

        static void prep()
        {
            LogTable256[0] = LogTable256[1] = 0;
            for (int i = 2; i < 256; i++)
            {
                LogTable256[i] = 1 + LogTable256[i / 2];
            }
            LogTable256[0] = -1; // if you want log(0) to return -1

            // Prepare the reverse bits table.
            prep_reverse_bits();
        }

        public static int FloorLog2(uint v)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            int r; // r will be lg(v)
            uint tt; // temporaries

            if ((tt = v >> 24) != 0)
            {
                r = (24 + LogTable256[tt]);
            }
            else if ((tt = v >> 16) != 0)
            {
                r = (16 + LogTable256[tt]);
            }
            else if ((tt = v >> 8) != 0)
            {
                r = (8 + LogTable256[tt]);
            }
            else
            {
                r = LogTable256[v];
            }
            return r;
        }

        public static long FloorLog2(ulong v)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            long r; // r will be lg(v)
            ulong tt; // temporaries

            if ((tt = v >> 56) != 0)
            {
                r = (56 + LogTable256[tt]);
            }
            else if ((tt = v >> 48) != 0)
            {
                r = (48 + LogTable256[tt]);
            }
            else if ((tt = v >> 40) != 0)
            {
                r = (40 + LogTable256[tt]);
            }
            else if ((tt = v >> 32) != 0)
            {
                r = (32 + LogTable256[tt]);
            }
            else if ((tt = v >> 24) != 0)
            {
                r = (24 + LogTable256[tt]);
            }
            else if ((tt = v >> 16) != 0)
            {
                r = (16 + LogTable256[tt]);
            }
            else if ((tt = v >> 8) != 0)
            {
                r = (8 + LogTable256[tt]);
            }
            else
            {
                r = LogTable256[v];
            }
            return r;
        }

        public static int CeilingLog2(uint v)
        {
            int r = Bithacks.FloorLog2(v);
            if (r < 0)
                return r;
            if (v != (uint)Bithacks.Power2((uint)r))
                return r + 1;
            else
                return r;
        }

        public static int Power2(uint v)
        {
            if (v == 0)
                return 1;
            else
                return (int)(2 << (int)(v - 1));
        }

        public static int Power2(int v)
        {
            if (v == 0)
                return 1;
            else
                return (int)(2 << (int)(v - 1));
        }

        static byte[] BitReverseTable256 = new byte[256];

        static void R2(ref int i, byte v)
        {
            BitReverseTable256[i++] = v;
            BitReverseTable256[i++] = (byte)(v + 2 * 64);
            BitReverseTable256[i++] = (byte)(v + 1 * 64);
            BitReverseTable256[i++] = (byte)(v + 3 * 64);
        }

        static void R4(ref int i, byte v)
        {
            R2(ref i, v);
            R2(ref i, (byte)(v + 2 * 16));
            R2(ref i, (byte)(v + 1 * 16));
            R2(ref i, (byte)(v + 3 * 16));
        }

        static void R6(ref int i, byte v)
        {
            R4(ref i, v);
            R4(ref i, (byte)(v + 2 * 4));
            R4(ref i, (byte)(v + 1 * 4));
            R4(ref i, (byte)(v + 3 * 4));
        }

        static void prep_reverse_bits()
        {
            int i = 0;
            R6(ref i, 0);
            R6(ref i, 2);
            R6(ref i, 1);
            R6(ref i, 3);
        }

        public static byte ReverseBits(byte from)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            return BitReverseTable256[from];
        }

        public static Int32 ReverseBits(Int32 from)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            Int32 result = 0;
            for (int i = 0; i < sizeof(Int32); ++i)
            {
                result = result << 8;
                result |= BitReverseTable256[(byte)(from & 0xff)];
                from = from >> 8;
            }
            return result;
        }

        public static UInt32 ReverseBits(UInt32 from)
        {
            if (!preped)
            {
                prep();
                preped = true;
            }
            UInt32 result = 0;
            for (int i = 0; i < sizeof(UInt32); ++i)
            {
                result = result << 8;
                result |= BitReverseTable256[(byte)(from & 0xff)];
                from = from >> 8;
            }
            return result;
        }

        static int Ones(uint x)
        {
            // 32-bit recursive reduction using SWAR...  but first step is mapping 2-bit values
            // into sum of 2 1-bit values in sneaky way
            x -= ((x >> 1) & 0x55555555);
            x = (((x >> 2) & 0x33333333) + (x & 0x33333333));
            x = (((x >> 4) + x) & 0x0f0f0f0f);
            x += (x >> 8);
            x += (x >> 16);
            return (int)(x & 0x0000003f);
        }

        public static int xFloorLog2(uint x)
        {
            x |= (x >> 1);
            x |= (x >> 2);
            x |= (x >> 4);
            x |= (x >> 8);
            x |= (x >> 16);
            return (Bithacks.Ones(x) - 1);
        }

        public static int Log2(uint x)
        {
            return FloorLog2(x);
        }

        public static int Log2(int x)
        {
            return FloorLog2((uint)x);
        }


    }

    [TestClass]
    public class Reduction
    {
        [TestMethod]
        public void ReductionT()
        {
            int n = Bithacks.Power2(10);
            int result_gpu = 0;
            int result_cpu = 0;
            {
                int[] data = new int[n];
                Campy.Parallel.For(n, idx => data[idx] = 1);
                for (int level = 1; level <= Bithacks.Log2(n); level++)
                {
                    int step = Bithacks.Power2(level);
                    Campy.Parallel.For(new Extent(n / step), idx =>
                    {
                        var i = step * idx;
                        data[i] = data[i] + data[i + step / 2];
                    });
                }
                result_gpu = data[0];
            }
            {
                int[] data = new int[n];
                for (int idx = 0; idx < n; ++idx) data[idx] = 1;
                for (int level = 1; level <= Bithacks.Log2(n); level++)
                {
                    int step = Bithacks.Power2(level);
                    for (int idx = 0; idx < n / step; idx++)
                    {
                        var i = step * idx;
                        data[i] = data[i] + data[i + step / 2];
                    }
                }
                result_cpu = data[0];
            }
            if (result_gpu != result_cpu) throw new Exception();
        }
    }
}

FFT

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using System.Linq;
using System.Numerics;

namespace FFT
{
    [TestClass]
    public class UnitTest1
    {
        /* Performs a Bit Reversal Algorithm on a postive integer 
         * for given number of bits
         * e.g. 011 with 3 bits is reversed to 110
         */
        public static int BitReverse(int n, int bits)
        {
            int reversedN = n;
            int count = bits - 1;

            n >>= 1;
            while (n > 0)
            {
                reversedN = (reversedN << 1) | (n & 1);
                count--;
                n >>= 1;
            }

            return ((reversedN << count) & ((1 << bits) - 1));
        }

        /* Uses Cooley-Tukey iterative in-place algorithm with radix-2 DIT case
         * assumes no of points provided are a power of 2 */
        public static void FFT(Complex[] buffer)
        {

            int bits = (int)Math.Log(buffer.Length, 2);
            for (int j = 1; j < buffer.Length / 2; j++)
            {
                int swapPos = BitReverse(j, bits);
                var temp = buffer[j];
                buffer[j] = buffer[swapPos];
                buffer[swapPos] = temp;
            }

            for (int N = 2; N <= buffer.Length; N <<= 1)
            {
                for (int i = 0; i < buffer.Length; i += N)
                {
                    for (int k = 0; k < N / 2; k++)
                    {
                        int evenIndex = i + k;
                        int oddIndex = i + k + (N / 2);
                        var even = buffer[evenIndex];
                        var odd = buffer[oddIndex];

                        double term = -2 * Math.PI * k / (double)N;
                        Complex exp = new Complex(Math.Cos(term), Math.Sin(term)) * odd;

                        buffer[evenIndex] = even + exp;
                        buffer[oddIndex] = even - exp;
                    }
                }
            }
        }

        public static void FFTGPU(Complex[] buffer)
        {
            int bits = (int)Math.Log(buffer.Length, 2);

            Campy.Parallel.For(buffer.Length / 2 - 1, k =>
            {
                int j = k + 1;
                int swapPos = BitReverse(j, bits);
                var temp = buffer[j];
                buffer[j] = buffer[swapPos];
                buffer[swapPos] = temp;
            });

            for (int N = 2; N <= buffer.Length; N <<= 1)
            {
                int step = N / 2;
                int bstep = N;
                Campy.Parallel.For(buffer.Length / 2, d =>
                {
                    var k = d % step;
                    var i = N * (d / step);
                    var t = d % step + N * (d / step);
                    int evenIndex = t;
                    int oddIndex = t + step;

                    var even = buffer[evenIndex];
                    var odd = buffer[oddIndex];

                    double term = -2 * Math.PI * k / (double)N;
                    Complex exp = new Complex(Math.Cos(term), Math.Sin(term)) * odd;

                    buffer[evenIndex] = even + exp;
                    buffer[oddIndex] = even - exp;
                });
            }
        }

        bool ApproxEqual(double a, double b)
        {
            if (b > a)
                return (b - a) < 0.01;
            else
                return (a - b) < 0.01;
        }

        [TestMethod]
        public void TestMethod1()
        {
            Complex[] input = { 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 };
            var copy = input.ToArray();

            FFTGPU(input);
            FFT(copy);

            for (int i = 0; i < input.Length; ++i)
            {
                if (!ApproxEqual(copy[i].Real, input[i].Real)) throw new Exception();
                if (!ApproxEqual(copy[i].Imaginary, input[i].Imaginary)) throw new Exception();
            }
        }
    }
}

 

Moving beyond the C++ AMP model

Up to this point, I had envisioned Campy as an API that would be similar to C++/AMP. However, I'm now thinking beyond that API, and looking for a simpler interface between CPU and GPU for C# and NET code than the C++/AMP model.

In particular, C++/AMP defines array<> and array_view<>, data types that abstract an array for the GPU. What I really would prefer is to be able to run C# code on a GPU with no markup/tags/wrappers/etc., and access all relevant C# data via the closure of all objects and code–including classes, structs, arrays, basic value types, etc., within the GPU that the code uses. In other words, almost zero boilerplate code involved to perform parallel computations on the GPU. However, this is easier said than done.

Take for example reduction (1). Here is a simple, clean, in-place implementation of reduction on an array of integers with binary operator '+' in a new, experimental version of Campy.

int n = Bithacks.Power2(20);
int[] data = new int[n];
Extent e = new Extent(n);

Campy.Parallel.For(new Extent(n), idx => data[idx] = 1);
for (int level = 1; level <= Bithacks.Log2(n); level++)
{
    int step = Bithacks.Power2(level);
    Campy.Parallel.For(new Extent(n / step), idx =>
    {
        var i = step * idx;
        data[i] = data[i] + data[i + step / 2];
    });
}

for (int i = 0; i < data.Length; ++i)
    System.Console.WriteLine(data[i]);

In this code, the first parallel for-loop initializes data; in the second parallel for-loop, it performs a sum using data and step. Variables data and step are shared between the CPU and GPU. At runtime, a delegate is created containing the function of the lambda expression passed to a Parallel.For() method. The delegate contains a target object for each variable used in the lambda. i.e., the closure (2). Implicit in this code is a SIMT computing model with shared memory so data structures can be shared. The final result of the sum is contained in data[0].

This example illustrates several problems. We note that the implementation of the C# data structures can vary with different implementations of the NET runtime. The GPU code must be aware of this implementation, or it must marshal the objects into an appropriate representation for the GPU. Many operations of the C# data structures rely on the NET runtime used. So, the GPU must be able to JIT the NET runtime. Complications of sharing data include alignment issues (64-bit fetch must be aligned on 64-bit boundaries; 3, 4, 5).  C# objects are allocated using the C# virtual memory manager, which we do not have any control of (6). The memory manager allocates objects from a heap, garbage collection of stale objects at any time.

As a first step into sharing data structures, let's assume we are accessing only value types. That is user-defined arrays and structures that contain only other value types–integers, characters, booleans, arrays, and structs. This limits the difficulties associated in the JIT of the NET runtime. Further, as we do not have access to the memory allocator/garbage collector, let us also assume the CIL does not contain calls to the memory system. These restrictions are assumed in ILGPU and Alea GPU.

With these limitations, a simple solution which I am working towards is that C# data structures can be converted to equivalent blittable types (7, 8), then deep copied to and from GPU memory. Unfortunately, this means a bit of copying to and from unmanaged memory, currently using memory allocated via cuMemAllocManaged. Remember, reference types–which is what most programmers use–are not handled. A simple example with Campy restricted to value types is available (9).

Alea GPU Version 3 provides a very similar interface to Campy, where kernel delegates are passed to a Parallel.For()-like routine for JITing and running. In Version 3, the attribute "[GpuManaged]" is used to alleviate the issues in synchronize data between CPU and GPU. ILGPU Version 0.1.4 does not handle reference types (i.e., classes).

–Ken Domino

Notes and References

(1) Reduction

  • REDUCE(⊕, A): The REDUCE operation takes a binary operator ⊕ with identity I, and an ordered set A = [a0, a1, …, an−1] of n elements. The value of REDUCE(⊕, A) is (((a0 ⊕ a1) ⊕ … ) ⊕ an−1).
  • Since addition is commutative and associative for integers, REDUCE can be computed in a parallel manner. One implementation is an in-place method, which modifies the input set. It consists of a nested for-loop within an outer for-loop as shown in the Campy code described above. The inner for-loop can be performed in parallel in pairs.

(2) Lambda Expressions (C# Programming Guide). Microsoft Documentation. https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/statements-expressions-operators/lambda-expressions

(3) Coding for Performance: Data alignment and structures. https://software.intel.com/en-us/articles/coding-for-performance-data-alignment-and-structures

(4) Data Alignment when Migrating to 64-Bit Intel® Architecture. https://software.intel.com/en-us/articles/data-alignment-when-migrating-to-64-bit-intel-architecture

(5) How Memory Is Accessed. https://software.intel.com/en-us/articles/how-memory-is-accessed

(6) https://github.com/dotnet/coreclr/issues/1235

(7) Puran Mehram, P. Managed code and unmanaged code in .NET. C# Corner. http://www.c-sharpcorner.com/uploadfile/puranindia/managed-code-and-unmanaged-code-in-net/

(8) Parsons, J. Changes to Blittable Types. GitHubGist. https://gist.github.com/jaredpar/cecc2f5fd76b70e480450296d4c9914e . Accessed Aug 29, 2017.

(9) You can look at and try the working test in Github.com. Note, it does not yet run the Reduction example due to alignment problems.

# git clone https://github.com/kaby76/Campy.git

# cd Campy; git checkout -b proof-of-concept

 

Two steps forward, one step back…

Campy is moving forward, albeit slowly. I have written new classes that models the C++ AMP more fully, and have a run code for a simple “Hello World” example on a NVIDIA GPU.

As it turns out, the package I was using, ManagedCUDA, won’t support Net Standard until that comes out, which might be Fall 2017–but maybe never, as I have checked Net Core 2.0/Net Standard Pre-release 1, and the Net Standard 2.0 package does not work on a simple example. So, I will be taking extra time to write an SWIG-generated library for the CUDA Driver API. It’s likely I will need a similar AMD GPU driver library as well but will do that after a release of Campy that works with NVIDIA GPUs.

In the meanwhile, I learned that there is another library available that is similar to Campy, namely ILGPU. It looks similar, but IMHO it is missing pieces of the C++ AMP programming model. However, you might want to look at that.

–Ken Domino

Inching towards an LLVM backend

Several months ago, I started working again on Campy. Since then, I’ve had to rewrite quite a bit of the source in order to make use of LLVM for code generation. So, Campy can now compile and execute a good subset for an x86/x64 target, which is a small but important step towards the goal of executing C# on a GPU. A significant part of the work was simply exposing a C# wrapper library for LLVM. That wrapper class targets Windows, Ubuntu, and Android, which as far as I can tell surpasses anything offered in NuGet. That library is Swigged.llvm, which you can find here.

The next few steps are to refine the compiler further and to target a NVIDIA GPU. In addition, I will need to write a layer to expose the SSA analysis in LLVM in order to take advantage of that to determine what functions/methods must be compiled for the GPU. There is much work to be done, but it is worth the effort.

CampyNET reboot…

Hi Folks,

After a long break, I’ve restarted development of CampyNET! I have been porting over the code to Visual Studio 2015. And, I will also port it to Visual Studio 2017 when VS 2017 is released in next month!

First of all, all the code now lives on Github.com: http://github.com/kaby76/campynet. I decided to move it to Github because that is the most popular Git website.

Second, there are a number of problems with the code with VS 2015 that I’ve fixed, but there are more to correct before I make a new release of the API. While making the port, I realized that much of the code in CampyNET for SSA was broken. I’ve fixed much of this. When you run CampyNET, there is some code that scans your environment for the C++ compiler and runtime libraries. That code is extremely fragile.  CampyNET also requires Uncrustify (http://uncrustify.sourceforge.net/) to format output C++ AMP kernel code; it really shouldn’t be a requirement.

Third, the distribution of CampyNET must be fixed. Up to now, CampyNET was distributed via a downloadable EXE file which you would run on your system. This is ridiculous! Instead, the package should be distributed via NuGet.

These problems need to be fixed before I make another release of the API sometime in the next month.

If you have questions, please send them directly to ken.domino <AT-SIGN> gmail.com.

Ken Domino

CLI-based APIs for GPGPU Computing

How does Campy compare to other CLI-based (e.g., C#, F#) APIs for GPGPU computing? The following is a partial list of what is currently available along with an example, which computes the even natural numbers.

Cudafy

Cudafy is an API targeting CUDA and OpenCL Programs and kernels are written in C# or other CIL languages. Cudafy scans the CIL, converts GPU code (labeled with the “[Cudafy]” attribute) into CUDA C++ or OpenCL, compiles the code into PTX, then run using CUDA.NET. Cudafy also uses Cloo (see below for more information on Cloo).

Note: This API is very good, and maintained fairly often. It is easy to use, and well documented.

Author: Nicholas Kopp; Download: http://cudafy.codeplex.com/  http://www.hybriddsp.com/Products/CUDAfyNET.aspx; Last updated Jan 2015; Started ~Jun 2011; Open source, GNU Library General Public License (LGPL).

Example:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Cudafy;
using Cudafy.Host;
using Cudafy.Translator;

namespace CudafyExample
{
    class Program
    {
        static void Main(string[] args)
        {
            CudafyModes.Target = eGPUType.Cuda;
            CudafyModes.DeviceId = 0;
            CudafyTranslator.Language = CudafyModes.Target == eGPUType.OpenCL ? eLanguage.OpenCL : eLanguage.Cuda;
            GPGPU gpu = CudafyHost.GetDevice(CudafyModes.Target, 0);
            eArchitecture arch = gpu.GetArchitecture();
            CudafyModule km = CudafyTranslator.Cudafy(arch);
            gpu.LoadModule(km);
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            int[] dev_a = gpu.Allocate<int>(a);
            for (int i = 0; i < N; i++)
                a[i] = i;
            gpu.CopyToDevice(a, dev_a);
            gpu.Launch(grid_size, block_size, fun, dev_a, N);
            gpu.CopyFromDevice(dev_a, a);
            gpu.Free(dev_a);
        }

        [Cudafy]
        public static void fun(GThread thread, int[] a, int N)
        {
            int tid = thread.blockIdx.x;
            if (tid < N)
            {
                a[tid] = a[tid] * 2;
            }
        }
    }
}

ManagedCUDA

ManagedCUDA is an API for the CUDA Driver API. The user writes kernels in CUDA C++, then compiles them into PTX via the CUDA compiler. ManagedCUDA executes the PTX kernels. Copying data to/from the GPU is achieved via an assignment, which is a nice simplification in syntax.

Note: This API is also well maintained, and easy to use. The main issue is that kernels must be written in CUDA C++, then compiled into PTX. If you’re looking for a way of writing kernels in C#, this solution isn’t for you. But, it is an excellent, straight forward API to the CUDA Driver API.

Author: kunzmi (Michael Kunz); Download: http://managedcuda.codeplex.com/; Last updated Feb 2015; Started ~ Jan 2011; Open source, GNU Library General Public License (LGPL).

Example:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using ManagedCuda;
using ManagedCuda.BasicTypes;
using ManagedCuda.VectorTypes;
using System.IO;
using System.Reflection;

namespace ManagedCudaExample
{
    class Program
    {
        static void Main(string[] args)
        {
            CudaContext ctx = new CudaContext(CudaContext.GetMaxGflopsDeviceId());
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            CudaDeviceVariable<int> dev_a = new CudaDeviceVariable<int>(N);
            for (int i = 0; i < N; i++)
                a[i] = i;
            dev_a = a;
            string resName = "kernel.ptx";
            string resNamespace = "ManagedCudaExample";
            string resource = resNamespace + "." + resName;
            Stream stream = Assembly.GetExecutingAssembly().GetManifestResourceStream(resource);
            if (stream == null) throw new ArgumentException("Kernel not found in resources.");
            CudaKernel kernel = ctx.LoadKernelPTX(stream, "kernel");
            int threadsPerBlock = block_size;
            kernel.BlockDimensions = block_size;
            kernel.GridDimensions = grid_size;
            kernel.Run(dev_a.DevicePointer, N);
            a = dev_a;
        }
    }
}

ManagedCUDA requires kernels built in PTX. The following is the kernel in CUDA C++, which can be compiled into PTX via the CUDA compiler with the –keep option.

//Includes for IntelliSense 
#include <cuda.h>
 
extern "C"  {
    //kernel code
	__global__ void kernel(int * a, int N)
	{
		int i = blockDim.x * blockIdx.x + threadIdx.x;
		if (i < N)
			a[i] = a[i] * 2;
	}
}

CUDA.NET

CUDA.NET is an API for the CUDA Driver API. Users write kernels in CUDA C++, compile them into PTX, then execute the PTX with CUDA.NET.

Download: http://www.cass-hpc.com/solutions/libraries/cuda-net; Last updated 2010; Started ~Apr 2008.

Notes: Supposedly, the code for CUDA.NET is open source, and has no license requirements. However, the archive page does not have a download link for source, just the binaries. I found the source in the Cudafy source tree. The latest build is for CUDA 3.0, which is very old (currently, NVIDIA is on CUDA 7.0). As I mentioned, the only source I could find is under Cudafy (Cudafy/3p/cuda.net3.0.0_win/src/…), which has been updated for Cudafy and CUDA 6.5. The documentation in the latest download, v3.0, is unreadable. The only other documentation I could find is for v2.0, http://www.hoopoe-cloud.com/files/cuda.net/2.0/CUDA.NET_2.0.pdf, but contains numerous inaccuracies.

Example:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using GASS.CUDA;
using GASS.CUDA.Types;
using GASS.Types;

namespace CudaNetExample
{
    class Program
    {
        static void Main(string[] args)
        {
            CUResult i0 = CUDADriver.cuInit(0);
            CUdevice dev = new CUdevice();
            CUResult i1 = CUDADriver.cuDeviceGet(ref dev, 0);
            CUcontext ctx = new CUcontext();
            CUResult i2 = CUDADriver.cuCtxCreate(ref ctx, 0, dev);

            CUdeviceptr p1 = new CUdeviceptr();
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            for (int i = 0; i < N; i++)
                a[i] = i;
            CUResult s1 = CUDADriver.cuMemAlloc(ref p1, N * sizeof(int));
            CUResult s2 = CUDADriver.cuMemcpyHtoD(p1, a, a.Length * sizeof(int));
            CUmodule mod = new CUmodule();
            CUResult s3 = CUDADriver.cuModuleLoad(ref mod, @"C:\Users\Ken\Documents\Visual Studio 2013\Projects\ManagedCudaExample\Win32Project1\Debug\kernel.ptx");
            CUfunction func = new CUfunction();
            CUResult s4 = CUDADriver.cuModuleGetFunction(ref func, mod, "kernel");

            CUResult r1 = CUDADriver.cuParamSeti(func, 0, (uint)p1.Pointer);
            CUResult r2 = CUDADriver.cuParamSeti(func, 4, (uint)N);
            CUResult r3 = CUDADriver.cuParamSetSize(func, 8);
            CUResult r4 = CUDADriver.cuFuncSetBlockShape(func, block_size, 1, 1);
            CUResult r5 = CUDADriver.cuLaunchGrid(func, grid_size, 1);
            CUResult r6 = CUDADriver.cuCtxSynchronize();

            CUResult f1 = CUDADriver.cuMemcpyDtoH(a, p1, a.Length * sizeof(int));
 
        }
    }
}

Alea GPU

Alea GPU is an API similar to Cudafy. Kernels are annotated with the “[AOTCompile]” attribute, which are then compiled for the GPU via LLVM.

Author: Daniel Egloff; Download: http://quantalea.com/homehttp://quantalea.com/static/app/tutorial/quick_start/quick_start_example.html; Last updated Mar 2015. Proprietary.

Example:

using System;
using Alea.CUDA;
using Alea.CUDA.Utilities;
using Alea.CUDA.IL;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace AleaExample
{
    class Program
    {
        [AOTCompile]
        static void kernel(deviceptr<int> a, int n)
        {
            var i = blockIdx.x * blockDim.x + threadIdx.x;
            if (i < n)
                a[i] = a[i] * 2;
        }

        static void Main(string[] args)
        {
            var inputs = Enumerable.Range(0, 1000000).Select(i => i).ToArray();
            var worker = Worker.Default;
            DeviceMemory<int> dInputs = worker.Malloc(inputs);
            const int blockSize = 256;
            var numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT;
            var gridSize = Math.Min(16 * numSm, Common.divup(inputs.Length, blockSize));
            var lp = new LaunchParam(gridSize, blockSize);
            worker.Launch<deviceptr<int>, int>(kernel, lp, dInputs.Ptr, inputs.Length);
        }
    }
}

Note: I could not get the example to compile and run. The Alea GPU compiler requires a license, which I could not get because the email confirmation link produced an error on the company’s server.

CudaSharp

In CudaSharp, users write kernels in C#. The GPU code is converted into PTX via LLVM, then executed via the ManagedCUDA API. Kernels can be lambdas.

Download: https://github.com/khyperia/CudaSharp; Last updated Jan 2014; Started Jan 2014; Open source, no license. Author: khyperia – Evan Huack

Note: This project is incomplete. The source code for the API does not build. Further, while in principle the code translates MS CIL, which is a stack-based assembly language, into LLVM, an SSA-based representation, it may not correct because there is no code that performs the aliasing of the stacks from multiple predecessors.

A similar project was written for a Masters Thesis by Soren Horup, Soren Juul, and Herik Larsen at Aalborg University in June 2011. The thesis is http://projekter.aau.dk/projekter/files/52819053/Report.pdf. However, I could not find the source code for the project.

GPU.NET

UK based company TidePowerd.com, started by Nicolas Beecroft and Jack Pappas, produced a CIL-based backend targeting CUDA GPUs. The website is no longer available, so the details aren’t clear. But, there are examples in https://github.com/tidepowerd/GPU.NET-Example-Projects, which indicate kernel markup with “[kernel]” attributes, and a high-level model of the GPU based upon CUDA C++.

GpuLinq

Written in F#, this LINQ-like API contains a compiler that runs queries on the GPU. https://github.com/nessos/GpuLinq/ The current status of the project is unknown.

Brahma

A project of the Software Engineering Research Group at Saint-Petersburg State University of the Russian Federation, the code supposedly was an API that targeted OpenCL. However, it’s no longer available. There is a similar project Brahma.FSharp, which seems to be active (last updated 2014), but the details are unclear. https://sites.google.com/site/semathsrprojects/home

Microsoft Accelerator

This project offered a LINQ-like API contains a compiler that runs queries on the GPU. The status of the project is unclear. http://research.microsoft.com/en-us/projects/Accelerator/

OpenCL.NET

OpenCL is an API for multicore computing, whether it be on a CPU or GPU. The API is platform and device neutral. OpenCL.NET is an API which is a thin wrapper on top of the OpenCL.dll assembly. OpenCL requires one to write the kernels in C. These are compiled, linked, loaded and run by the OpenCL drivers. Note: OpenCL.NET has not be updated since Sep 2013. https://openclnet.codeplex.com/

Example:

using System;
using System.Diagnostics;
using System.Linq;
using OpenCL.Net.Extensions;
using OpenCL.Net;
using System.IO;
using System.Collections.Generic;

namespace Opencl_Device_Query
{
    class Program
    {
        static void Main(string[] args)
        {
            List<String> source_list = new List<string>();            
            try
            {
                using (StreamReader sr = new StreamReader("kernel.cl"))
                {
                    String line = sr.ReadToEnd();
                    source_list.Add(line);
                }
            }
            catch (Exception e)
            {
                Console.WriteLine("The file could not be read:");
                Console.WriteLine(e.Message);
                System.Environment.Exit(1);
            }
            String[] source = source_list.ToArray();
            IntPtr[] lengths = new IntPtr[source.Length];
            for (int i = 0; i < source.Length; ++i) lengths[i] = (IntPtr)source[i].Length;
            ErrorCode e1;
            Context context = Cl.CreateContext("NVIDIA CUDA", DeviceType.Gpu, out e1);
            OpenCL.Net.Program program = Cl.CreateProgramWithSource(context, 1, source, lengths, out e1);
            InfoBuffer ib = Cl.GetContextInfo(context, ContextInfo.Devices, out e1);
            Device[] devices = ib.CastToArray<Device>(ib.Size);
            e1 = Cl.BuildProgram(program, 1, devices, "", null, (IntPtr)0);
            if (e1 != ErrorCode.Success)
            {
                ib = Cl.GetProgramBuildInfo(program, devices[0], ProgramBuildInfo.Log, out e1);
                char[] log = ib.CastToArray<char>(ib.Size);
                System.Console.WriteLine(new String(log));
                System.Environment.Exit(1);
            }
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            for (int i = 0; i < N; i++)
                a[i] = i;
            IMem<int> dev_a = Cl.CreateBuffer<int>(context, MemFlags.ReadWrite, N, out e1);
            Kernel kernel = Cl.CreateKernel(program, "mykernel", out e1);
            e1 = Cl.SetKernelArg(kernel, 0, dev_a);
            e1 = Cl.SetKernelArg(kernel, 1, N);
            CommandQueue command_queue = Cl.CreateCommandQueue(context, devices[0], CommandQueueProperties.None, out e1);
            Event evt;
            e1 = Cl.EnqueueWriteBuffer(command_queue, dev_a, Bool.True, a, 0, null, out evt);
            IntPtr[] tiles;
            IntPtr[] tile_size;
            l2t(grid_size * block_size, 1, out tile_size, out tiles);
            e1 = Cl.EnqueueNDRangeKernel(command_queue, kernel, 1, null, (IntPtr[])tiles, (IntPtr[])tile_size, 0, null, out evt);
            e1 = Cl.EnqueueReadBuffer(command_queue, dev_a, Bool.True, a, 0, null, out evt);
        }

        static int the_blocksize = 256;

        static void l2t(int size, int max_dimensionality, out IntPtr[] tile_size, out IntPtr[] tiles)
        {
            tile_size = new IntPtr[3];
            tiles = new IntPtr[3];            
            for (int j = 0; j < max_dimensionality; ++j)
                tiles[j] = (IntPtr)1;
            int[] max_threads = new int[3]{ the_blocksize, 64, 64};
            int[] max_blocks = new int[3] { 65535, 65535, 65535 };
            for (int j = 0; j < max_dimensionality; ++j)
                tile_size[j] = (IntPtr)1;

            int b = size / (max_threads[0] * max_blocks[0]);
            if (b == 0)
            {
                b = size / max_threads[0];
                if (size % max_threads[0] != 0)
                    b++;

                if (b == 1)
                    max_threads[0] = size;

                // done. return the result.
                tiles[0] = (IntPtr)b;
                tile_size[0] = (IntPtr)max_threads[0];

                // OpenCL uses multiples of tile_size.
                tiles[0] = (IntPtr)((int)tile_size[0] * (int)tiles[0]);
                return;
            }

            int sqrt_size = (int)Math.Sqrt((float)size / max_threads[0]);
            sqrt_size++;

            int b2 = sqrt_size / max_blocks[1];
            if (b2 == 0)
            {
                b = sqrt_size;

                // done. return the result.
                tiles[0] = tiles[1] = (IntPtr)b;
                tile_size[0] = (IntPtr)max_threads[0];

                // OpenCL uses multiples of tile_size.
                tiles[0] = (IntPtr)((int)tile_size[0] * (int)tiles[0]);
                //        tiles[1] *= tile_size[1];
                return;
            }
        }
    }
}


//kernel code
__kernel void mykernel(__global int * a, int N)
{
    int i = get_global_id(0);
	if (i < N)
		a[i] = a[i] * 2;
}

Cloo

An OpenCL API. Note: Although I was able to get the example to work with Cloo, it wasn’t easy. There are no definitive examples nor documentation on the project website. Familiarity with the standard OpenCL API helped me to get something working. I found the API confusing for a couple reasons: sometimes one would use the “new” operator to create an object (e.g., ComputeBuffer), and other times a factory pattern (e.g., program.CreateKernel); sometimes one would access a static (e.g., ComputePlatform.Platforms), then alternatively call a method of an instance (e.g., program.GetBuildLog). Compare the examples of Cloo and OpenCL.NET to explore the differences. API was last updated in 2012.  http://cloo.sourceforge.net/ http://sourceforge.net/projects/cloo/

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Cloo;
using Cloo.Bindings;
using System.IO;
using System.Runtime.InteropServices;

namespace ClooSample
{
    class Program
    {
        static void Main(string[] args)
        {
            List<String> source_list = new List<string>();
            try
            {
                using (StreamReader sr = new StreamReader("kernel.cl"))
                {
                    String line = sr.ReadToEnd();
                    source_list.Add(line);
                }
            }
            catch (Exception e)
            {
                Console.WriteLine("The file could not be read:");
                Console.WriteLine(e.Message);
                System.Environment.Exit(1);
            }
            String[] source = source_list.ToArray();
            ComputePlatform cp = ComputePlatform.Platforms[1];
            ComputeContextPropertyList properties = new ComputeContextPropertyList(cp);
            ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu, properties, null, IntPtr.Zero);
            IReadOnlyCollection<ComputeDevice> devices = context.Devices;
            ComputeDevice[] devs = new ComputeDevice[1];
            devs[0] = devices.First();
            Cloo.ComputeProgram program = new ComputeProgram(context, source);
            try
            {
                program.Build(devs, "", (ComputeProgramBuildNotifier)null, IntPtr.Zero);
            }
            catch
            {
                String error = program.GetBuildLog(devs[0]);
                System.Console.WriteLine(error);
                System.Environment.Exit(1);
            }
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            for (int i = 0; i < N; i++)
                a[i] = i;
            Cloo.ComputeBuffer<int> deva = new ComputeBuffer<int>(context, ComputeMemoryFlags.ReadWrite, a);
            Cloo.ComputeKernel kernel = program.CreateKernel("mykernel");
            kernel.SetMemoryArgument(0, deva);
            kernel.SetValueArgument(1, N);
            Cloo.ComputeCommandQueue command_queue = new ComputeCommandQueue(context, devs[0], ComputeCommandQueueFlags.None);
            command_queue.WriteToBuffer(a, deva, true, null);
            long[] tiles;
            long[] tile_size;
            l2t(grid_size * block_size, 1, out tile_size, out tiles);
            command_queue.Execute(kernel, null, tiles, tile_size, null);
            command_queue.ReadFromBuffer(deva, ref a, true, null);
        }

        static int the_blocksize = 256;

        static void l2t(int size, int max_dimensionality, out long[] tile_size, out long[] tiles)
        {
            tile_size = new long[max_dimensionality];
            tiles = new long[max_dimensionality];
            for (int j = 0; j < max_dimensionality; ++j)
                tiles[j] = (long)1;
            int[] max_threads = new int[3] { the_blocksize, 64, 64 };
            int[] max_blocks = new int[3] { 65535, 65535, 65535 };
            for (int j = 0; j < max_dimensionality; ++j)
                tile_size[j] = (long)1;

            int b = size / (max_threads[0] * max_blocks[0]);
            if (b == 0)
            {
                b = size / max_threads[0];
                if (size % max_threads[0] != 0)
                    b++;

                if (b == 1)
                    max_threads[0] = size;

                // done. return the result.
                tiles[0] = (long)b;
                tile_size[0] = (long)max_threads[0];

                // OpenCL uses multiples of tile_size.
                tiles[0] = (long)((int)tile_size[0] * (int)tiles[0]);
                return;
            }

            int sqrt_size = (int)Math.Sqrt((float)size / max_threads[0]);
            sqrt_size++;

            int b2 = sqrt_size / max_blocks[1];
            if (b2 == 0)
            {
                b = sqrt_size;

                // done. return the result.
                tiles[0] = tiles[1] = (long)b;
                tile_size[0] = (long)max_threads[0];

                // OpenCL uses multiples of tile_size.
                tiles[0] = (long)((int)tile_size[0] * (int)tiles[0]);
                //        tiles[1] *= tile_size[1];
                return;
            }
        }
    }
}


Note: A file containing all these examples in MS Visual Studio 2013 projects is available https://domemtech.box.com/s/a472ynr3duu1vlle2i0dnwjpzclfd0k6.

Updated May 3, 2015

First release of Campy.NET

The first release build of Campy.NET, v0.01.0000, has been uploaded to campynet.codeplex.com. To install, download the file, unzip, then execute setup.exe. Campy only works on Windows 8 or 10, and you will need to install Visual Studio Express for Windows Desktop. (Note, VS Professional or Ultimate will also work.) You won’t need to set the CAMPYNETROOT environmental variable mentioned previously, nor download ILSpy or Uncrustify, since these are now installed by the setup program. This release is built with debugging.

To run the code, create a small “Hello World” program, e.g.:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Campy;
using Campy.Types;

namespace ConsoleApplication10
{
    class Program
    {
        static void Main(string[] args)
        {
            int size = 10;
            int[] data = new int[size];
            for (int i = 0; i < size; ++i) data[i] = 2 * i;
            Extent e = new Extent(size);
            Array_View<int> d = new Array_View<int>(ref data);
            AMP.Parallel_For_Each(d.Extent, (Index idx) =>
            {
                int j = idx[0];
                d[j] = size - j;
            });
            d.Synchronize();
            for (int i = 0; i < size; ++i)
            {
                System.Console.WriteLine(data[i]);
            }
        }
    }
}

Then, in the Solution Explorer, add references to all the DLLs in the Campy.NET installation directory. Finally, build and run the program. Please note: Campy compiles on the fly as the program executes. If you previously built and ran your program, then make changes, Campy will need to rebuild the GPU code. Your program will stop, and prompt you to restart your program in order for Campy to finish rebuilding the GPU code.

Any questions or problems, let me know!

Next, I will be working on enhancing control flow analysis so that function calls of any form will be converted into GPU code. The goal is for programmers to have no need to annotate GPU code using attributes or modifiers. The compiler should be able to do that.

Ken

Array, Tiles, Shared Memory, and Generics

New data types have been added to Campy.NET:

  • Array<>, modeled after C++ AMP array<>, is used to represent accelerator memory. Note: only one-dimensional arrays are implemented.
  • Tiled_Index, modeled after C++ AMP tiled_index, is used to represent thread identification, and synchronization.
  • Tile_Static, modeled after C++ AMP tile_static, is used to represent shared memory between threads in a tile.
  • Copy() is used to copy data between the accelerators and CPU.
  • C# Generics for GPU code has been implemented.

An example of the use of these types is available in here. Two types of reduction are implemented.

A milestone accomplished–computing the value of pi

Campy.NET has passed a milestone: it can now approximate the value of pi. The code, below, is an implementation of the classic toy example “summing a circle’s area”.

using Campy;
using Campy.Types;
using System;

namespace Test
{
    class Program
    {
        class Point
        {
            public float x;
            public float y;
        }
        static void Main(string[] args)
        {
            int half_size = 1000;
            int size = half_size * half_size;

            Point[] data = new Point[size];
            for (int i = 0; i < size; ++i) data[i] = new Point();
            Array_View<Point> points = new Array_View<Point>(ref data);
            Extent e = new Extent(size);
            AMP.Parallel_For_Each(e, (Index idx) =>
            {
                int i = idx[0];
                points[i].x = (float)(1.0 * (i / half_size) / half_size);
                points[i].y = (float)(1.0 * (i % half_size) / half_size);
            });
            points.Synchronize();
            int[] insc = new int[size];
            Array_View<int> ins = new Array_View<int>(ref insc);
            //ins.discard_data();
            AMP.Parallel_For_Each(e, (Index idx) =>
            {
                int i = idx[0];
                float radius = 1.0f;
                float tx = points[i].x;
                float ty = points[i].y;
                float t = (float)Math.Sqrt(tx*tx + ty*ty);
                ins[i] = (t <= radius) ? 1 : 0;
            });
            Extent e_half = new Extent(half_size);
            int[] count = new int[1];
            count[0] = 0;
            Array_View<int> res = new Array_View<int>(ref count);
            AMP.Parallel_For_Each(e_half, (Index idx) =>
            {
                int i = idx[0];
                for (int j = 1; j < half_size; ++j)
                {
                    int k = i * half_size;
                    int t1 = ins[k + j];
                    int t2 = ins[k];
                    int t3 = t1 + t2;
                    ins[k] = t3;
                    // cannot decompile!!! ins[i * half_size] += ins[i * half_size + j];
                }
                AMP.Atomic_Fetch_Add(ref res, 0, ins[i * half_size]);
            });
            int cou = res[0];
            System.Console.WriteLine("Count is " + cou + " out of " + size);
            float pi = (4.0f * cou) / size;
            System.Console.WriteLine("Pi is " + pi);
        }
    }
}

Notes:

Although the example runs only slightly faster than the CPU equivalent implementation, there are several problems. First, unnecessary memory copy at line 21 occurs between CPU and GPU because the equivalent to discard_data has not yet been implemented. In addition, the last kernel uses an atomic operation to increment the count of the points within the circle (line 58). Converting the code to use the Blelloch method of reduction, and removing the unnecessary array “points” would improve the performance probably an order of magnitude or two. Alternatively, we could count the number of point outside the circle, then compute pi using the difference with the total number of points. In theory, this would result in less contention because there would be fewer points counted. But, this is the joy of GPU programming.

Also, in the last kernel, the ILSpy decompiler, which Campy.NET uses, has a notable problem with the code. The generated C++ AMP code cannot compile. The equivalent code is provided.

 

Salient Goals of Campy.NET

Campy.NET is designed with a unique set of features for GPGPU programming. These features are intended for programmers wishing to easily express PRAM-like parallel algorithms for the GPU using C#.

Campy.NET supports interleaving of blocks of CPU and GPU code

Campy.NET is intended to support GPU program using lambda expressions. The primary advantage of this approach is that GPU code can be interleaved with CPU code, which results in clean and easily understandable code.

Let’s start by giving an example: initialize an array of integers with natural numbers.  On a CPU, in C#, this is easily done with just a few lines of code, using a for-loop:

for (int j = 0; j < e; ++j)
{
    a[j] = j;
}

In Campy.NET, the code to perform the initialization on the GPU is similarly expressed:

AMP.Parallel_For_Each(e, (Index i) =>
{
    int j = i[0];
    a[j] = j;
});

This contrasts with CUDA, OpenCL, and wrappers for those languages. In all of these platforms, one is required to place the GPU code in the body of a specially annotated function called a kernel, and to call the kernel in a separate function. In fact, CPU and GPU code often are separated into different files.

Let’s consider what this simple array initialization would look like in CUDA. When reading function “kernel” alone, we do not know what range of elements of the array “a” are assigned; when reading function “host_code” alone, we do not know what “kernel” is performing for a given range. Only when read together does the user understand how host_code and kernel interact.

// GPU code
__global__ void kernel(int * a, int e)
{
    int j = threadIdx.x;
    a[j] = j;
}

...

// CPU code
void host_code()
{
    ...
    kernel<<<g,e>>>(a, e);
    ...
}

This problem is known as the Locality of Definition. Since CPU and GPU code blocks are separated, the readability and understandability of the algorithm suffers.

Campy.NET does not use additional syntax to mark GPU code

Campy.NET does not use language annotations to denote what is CPU code vs. GPU code. Campy.NET determines at runtime what code is GPU code when the user calls AMP.Parallel_For_Each.

In CUDA, OpenCL, OpenACC, C++ AMP, etc., the language (C++ or C) is enhanced with modifiers, pragmas, or attributes, in order to annotate GPU code. All code that is intended to run on the GPU must be annotated.

In the previous example, the function “kernel” is annotated with “__global__” to the compiler recognizes that the function is GPU code. In C++ AMP, GPU code is annotated with “restrict(amp)”.

In OpenACC, loops are annotated with pragmas to provide information to the compiler GPU code about which loop to parallelize, and which variables to capture in the body of the for-loop. For example, from the NVIDIA Parallel ForEach blog, the Jacobi iteration is shown:

while ( error > tol && iter < iter_max ) {
    error = 0.f;
 
    #pragma omp parallel for shared(m, n, Anew, A)
    for( int j = 1; j < n-1; j++) {
        for( int i = 1; i < m-1; i++ ) {
            Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
                                 + A[j-1][i] + A[j+1][i]);
            error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));
        }
    }
 
    #pragma omp parallel for shared(m, n, Anew, A)
    for( int j = 0; j < n-1; j++) {
        for( int i = 0; i < m-1; i++ ) {
            A[j][i] = Anew[j][i];   
        }
    }
 
    if(iter % 100 == 0) printf("%d, %0.6fn", iter, error);
 
    iter++;
}

Unfortunately, custom pragmas are not available in the C# compiler, although the Roslyn compiler is now open source, and the compiler could be enhanced.

As mentioned above, in Campy.NET, GPU code is simply a lambda expression passed to the Campy.NET function AMP.Parallel_For_Each(). When the function is executed, Campy.NET performs a control flow analysis to determine what functions to convert to C++ AMP, then compiles, links, and calls the function. Lambda expressions capture variables defined prior to the call to AMP.Parallel_For_Each. In order to prevent the re-translation and compilation of the lambda expression, Campy.NET examines the timestamp of the program on the assemblies (DLL and EXE) for the program.

Campy.NET is modeled after C++ AMP, with enhancements

C++ AMP uses compile-time type checking to enforce the semantics of the API. Sometimes, the error messages yield good information of where to find the error in your program. For example, when using a CPU array inside a GPU code block.

    int * insc = new int[size];
	array_view<int, 1> ins(size, insc);
	ins.discard_data();
	parallel_for_each(e, [=](index<1> idx) restrict(amp)
	{
		int i = idx[0];
		float radius = 1.0;
		float tx = points[i].x;
		float ty = points[i].y;
		float t = sqrt(tx*tx + ty*ty);
		insc[i] = (t <= radius) ? 1 : 0;
	});

Other times, the C++ compiler gives cryptic error messages which do nothing to identify the cause of the problem, e.g., in this case, using a bool (C++ AMP does not support arrays of bools):

    bool * insc = new bool[size];
	array_view<bool, 1> ins(size, insc);
	ins.discard_data();
	parallel_for_each(e, [=](index<1> idx) restrict(amp)
	{
		int i = idx[0];
		float radius = 1.0;
		float tx = points[i].x;
		float ty = points[i].y;
		float t = sqrt(tx*tx + ty*ty);
		ins[i] = (t <= radius) ? true : false;
	});

In C++ AMP, a programmer can only pass a lambda to another either as an local variable (i.e., “auto”), or as a parameter whose type is a template parameter:

void capture_auto()
{
    ...

    auto IsEven = [=](int t) restrict(amp) -> bool
    {
        return t % 2 == 0;
    };

    parallel_for_each(extent<1>(size), [=](index<1> idx) restrict(amp)
    {
        int i = idx[0];
        a[i] = IsEven(i);
    });

    ...
}

template<typename _K>
void pass_as_parameter(const _K & IsOdd)
{
    ...

    parallel_for_each(extent<1>(size), [=](index<1> idx) restrict(amp)
    {
        int i = idx[0];
        a[i] = IsOdd(i);
    });

	...
}

void test2()
{
    auto test = [=](int t) restrict(amp) -> bool
    {
        return t % 2 = 1;
    };
    pass_as_parameter(test);
}

In C++ AMP, tiled_extent/index must use integer constants defined at compile time.

C++ AMP cannot use the C++11 standard for lambda function type definitions via “std::function<>” because C++ AMP requires the addition of the “restrict(amp)” which is not supported in std::function<>.

Campy.NET avoids many of these problems. It provides additional type checking to warn users of problems. Campy.NET allows one to define the type of the lambda expression and use that when passing functions as parameters. The Campy.NET types for tile sizes can be variable.

Campy.NET is an API, not a new language

Campy.NET is an API implemented in the C# language for parallel programming; it does not alter the compiler or runtime. Campy.NET is essentially a C# wrapper for C++ AMP.

Global memory management is seemless

Like C++ AMP, Campy.NET uses a container to wrap the access of a data structure in CPU memory. When accessed in GPU code, the data is copied to GPU memory automatically. After the Parallel_For_Each completes, the data is copied back to CPU memory only if the wrapper is accessed in CPU code.

Support for shared memory and thread synchronization

Campy.NET supports share memory and thread synchronization for efficient sub-division of large problems.

The following is an example from the NVIDIA Parallel ForAll column (http://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/), converted into Campy.NET.

using System;
using Campy;
using Campy.Types;

namespace Reverse
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a large array of numbers, some containing sixes.
            int size = 64;
            int[] data = new int[size];
            for (int i = 0; i < size; ++i) data[i] = i;
            Array_View<int> d = new Array_View<int>(ref data);

            // Initialize...
            Extent e = new Extent(size);
            AMP.Parallel_For_Each(d.Extent, (Index idx) =>
            {
                int j = idx[0];
                d[j] = j;
            });

            // Reverse...
            Tile_Static<int> s = new Tile_Static<int>(64);
            AMP.Parallel_For_Each(d.Extent.Tile(size), (Tiled_Index idx) =>
            {
                int t = idx.Local[0];
                int tr = size - t - 1;
                s[t] = d[t];
                idx.Barrier.Wait();
                d[t] = s[tr];
            });
            for (int i = 0; i < size; ++i)
                System.Console.WriteLine(d[i]);
        }
    }
}

Support for atomic operations

Campy.NET provides atomic addition, subtraction, fetch, and others, for thread synchronized access to memory. An example is the classic “sum of sixes” problem.

using System;
using Campy;
using Campy.Types;

namespace SumOfSixes
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a large array of numbers, some containing sixes.
            int size = 50000000;
            int[] data = new int[size];
            Array_View<int> d = new Array_View<int>(ref data);

            // Initialize...
            Extent e = new Extent(size);
            AMP.Parallel_For_Each(d.Extent, (Index idx) =>
            {
                int j = idx[0];
                d[j] = (j % 30) == 0 ? 6 : j;
            });

            // Count sixes...
            int[] result = new int[1];
            Array_View<int> r = new Array_View<int>(ref result);
            AMP.Parallel_For_Each(d.Extent, (Index idx) =>
            {
                int j = idx[0];
                if (d[j] == 6)
                    AMP.Atomic_Fetch_Add(ref r, 0, 1);
            });
            System.Console.WriteLine("result = " + r[0]);
        }
    }
}

Support for multi-core CPUs or GPUs

Campy.NET supports the execution of parallel code on any multi-core processor, as long as it is supported in C++ AMP. A programmer can easily select the processor to perform the parallel execution of the lambda functions.

Status

That said, the overall design and implementation of Campy.NET API are unset. While there is an effort to model it after C++ AMP, the API must be redesigned to avoid the same pitfalls of C++ AMP, such as not being able to handle pointers. See http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/19/restrict-amp-restrictions-part-0-of-n-introduction.aspx . Currently, it also relies heavily on the Visual Studio C++ compiler, which it should not. If you are interested in helping out, please let me (Ken Domino) know.

(Note: updated Feb 7, 2015 for API changes.)