Campy

CLI-based APIs for GPGPU Computing

Posted on March 15, 2015May 3, 2015 by kaby76

How does Campy compare to other CLI-based (e.g., C#, F#) APIs for GPGPU computing? The following is a partial list of what is currently available along with an example, which computes the even natural numbers.

Cudafy

Cudafy is an API targeting CUDA and OpenCL Programs and kernels are written in C# or other CIL languages. Cudafy scans the CIL, converts GPU code (labeled with the “[Cudafy]” attribute) into CUDA C++ or OpenCL, compiles the code into PTX, then run using CUDA.NET. Cudafy also uses Cloo (see below for more information on Cloo).

Note: This API is very good, and maintained fairly often. It is easy to use, and well documented.

Author: Nicholas Kopp; Download: http://cudafy.codeplex.com/ http://www.hybriddsp.com/Products/CUDAfyNET.aspx; Last updated Jan 2015; Started ~Jun 2011; Open source, GNU Library General Public License (LGPL).

Example:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Cudafy;
using Cudafy.Host;
using Cudafy.Translator;

namespace CudafyExample
{
    class Program
    {
        static void Main(string[] args)
        {
            CudafyModes.Target = eGPUType.Cuda;
            CudafyModes.DeviceId = 0;
            CudafyTranslator.Language = CudafyModes.Target == eGPUType.OpenCL ? eLanguage.OpenCL : eLanguage.Cuda;
            GPGPU gpu = CudafyHost.GetDevice(CudafyModes.Target, 0);
            eArchitecture arch = gpu.GetArchitecture();
            CudafyModule km = CudafyTranslator.Cudafy(arch);
            gpu.LoadModule(km);
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            int[] dev_a = gpu.Allocate<int>(a);
            for (int i = 0; i < N; i++)
                a[i] = i;
            gpu.CopyToDevice(a, dev_a);
            gpu.Launch(grid_size, block_size, fun, dev_a, N);
            gpu.CopyFromDevice(dev_a, a);
            gpu.Free(dev_a);
        }

        [Cudafy]
        public static void fun(GThread thread, int[] a, int N)
        {
            int tid = thread.blockIdx.x;
            if (tid < N)
            {
                a[tid] = a[tid] * 2;
            }
        }
    }
}

ManagedCUDA

ManagedCUDA is an API for the CUDA Driver API. The user writes kernels in CUDA C++, then compiles them into PTX via the CUDA compiler. ManagedCUDA executes the PTX kernels. Copying data to/from the GPU is achieved via an assignment, which is a nice simplification in syntax.

Note: This API is also well maintained, and easy to use. The main issue is that kernels must be written in CUDA C++, then compiled into PTX. If you’re looking for a way of writing kernels in C#, this solution isn’t for you. But, it is an excellent, straight forward API to the CUDA Driver API.

Author: kunzmi (Michael Kunz); Download: http://managedcuda.codeplex.com/; Last updated Feb 2015; Started ~ Jan 2011; Open source, GNU Library General Public License (LGPL).

Example:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using ManagedCuda;
using ManagedCuda.BasicTypes;
using ManagedCuda.VectorTypes;
using System.IO;
using System.Reflection;

namespace ManagedCudaExample
{
    class Program
    {
        static void Main(string[] args)
        {
            CudaContext ctx = new CudaContext(CudaContext.GetMaxGflopsDeviceId());
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            CudaDeviceVariable<int> dev_a = new CudaDeviceVariable<int>(N);
            for (int i = 0; i < N; i++)
                a[i] = i;
            dev_a = a;
            string resName = "kernel.ptx";
            string resNamespace = "ManagedCudaExample";
            string resource = resNamespace + "." + resName;
            Stream stream = Assembly.GetExecutingAssembly().GetManifestResourceStream(resource);
            if (stream == null) throw new ArgumentException("Kernel not found in resources.");
            CudaKernel kernel = ctx.LoadKernelPTX(stream, "kernel");
            int threadsPerBlock = block_size;
            kernel.BlockDimensions = block_size;
            kernel.GridDimensions = grid_size;
            kernel.Run(dev_a.DevicePointer, N);
            a = dev_a;
        }
    }
}

ManagedCUDA requires kernels built in PTX. The following is the kernel in CUDA C++, which can be compiled into PTX via the CUDA compiler with the –keep option.

//Includes for IntelliSense 
#include <cuda.h>
 
extern "C"  {
    //kernel code
	__global__ void kernel(int * a, int N)
	{
		int i = blockDim.x * blockIdx.x + threadIdx.x;
		if (i < N)
			a[i] = a[i] * 2;
	}
}

CUDA.NET

CUDA.NET is an API for the CUDA Driver API. Users write kernels in CUDA C++, compile them into PTX, then execute the PTX with CUDA.NET.

Download: http://www.cass-hpc.com/solutions/libraries/cuda-net; Last updated 2010; Started ~Apr 2008.

Notes: Supposedly, the code for CUDA.NET is open source, and has no license requirements. However, the archive page does not have a download link for source, just the binaries. I found the source in the Cudafy source tree. The latest build is for CUDA 3.0, which is very old (currently, NVIDIA is on CUDA 7.0). As I mentioned, the only source I could find is under Cudafy (Cudafy/3p/cuda.net3.0.0_win/src/…), which has been updated for Cudafy and CUDA 6.5. The documentation in the latest download, v3.0, is unreadable. The only other documentation I could find is for v2.0, http://www.hoopoe-cloud.com/files/cuda.net/2.0/CUDA.NET_2.0.pdf, but contains numerous inaccuracies.

Example:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using GASS.CUDA;
using GASS.CUDA.Types;
using GASS.Types;

namespace CudaNetExample
{
    class Program
    {
        static void Main(string[] args)
        {
            CUResult i0 = CUDADriver.cuInit(0);
            CUdevice dev = new CUdevice();
            CUResult i1 = CUDADriver.cuDeviceGet(ref dev, 0);
            CUcontext ctx = new CUcontext();
            CUResult i2 = CUDADriver.cuCtxCreate(ref ctx, 0, dev);

            CUdeviceptr p1 = new CUdeviceptr();
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            for (int i = 0; i < N; i++)
                a[i] = i;
            CUResult s1 = CUDADriver.cuMemAlloc(ref p1, N * sizeof(int));
            CUResult s2 = CUDADriver.cuMemcpyHtoD(p1, a, a.Length * sizeof(int));
            CUmodule mod = new CUmodule();
            CUResult s3 = CUDADriver.cuModuleLoad(ref mod, @"C:\Users\Ken\Documents\Visual Studio 2013\Projects\ManagedCudaExample\Win32Project1\Debug\kernel.ptx");
            CUfunction func = new CUfunction();
            CUResult s4 = CUDADriver.cuModuleGetFunction(ref func, mod, "kernel");

            CUResult r1 = CUDADriver.cuParamSeti(func, 0, (uint)p1.Pointer);
            CUResult r2 = CUDADriver.cuParamSeti(func, 4, (uint)N);
            CUResult r3 = CUDADriver.cuParamSetSize(func, 8);
            CUResult r4 = CUDADriver.cuFuncSetBlockShape(func, block_size, 1, 1);
            CUResult r5 = CUDADriver.cuLaunchGrid(func, grid_size, 1);
            CUResult r6 = CUDADriver.cuCtxSynchronize();

            CUResult f1 = CUDADriver.cuMemcpyDtoH(a, p1, a.Length * sizeof(int));
 
        }
    }
}

Alea GPU

Alea GPU is an API similar to Cudafy. Kernels are annotated with the “[AOTCompile]” attribute, which are then compiled for the GPU via LLVM.

Author: Daniel Egloff; Download: http://quantalea.com/home; http://quantalea.com/static/app/tutorial/quick_start/quick_start_example.html; Last updated Mar 2015. Proprietary.

Example:

using System;
using Alea.CUDA;
using Alea.CUDA.Utilities;
using Alea.CUDA.IL;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace AleaExample
{
    class Program
    {
        [AOTCompile]
        static void kernel(deviceptr<int> a, int n)
        {
            var i = blockIdx.x * blockDim.x + threadIdx.x;
            if (i < n)
                a[i] = a[i] * 2;
        }

        static void Main(string[] args)
        {
            var inputs = Enumerable.Range(0, 1000000).Select(i => i).ToArray();
            var worker = Worker.Default;
            DeviceMemory<int> dInputs = worker.Malloc(inputs);
            const int blockSize = 256;
            var numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT;
            var gridSize = Math.Min(16 * numSm, Common.divup(inputs.Length, blockSize));
            var lp = new LaunchParam(gridSize, blockSize);
            worker.Launch<deviceptr<int>, int>(kernel, lp, dInputs.Ptr, inputs.Length);
        }
    }
}

Note: I could not get the example to compile and run. The Alea GPU compiler requires a license, which I could not get because the email confirmation link produced an error on the company’s server.

CudaSharp

In CudaSharp, users write kernels in C#. The GPU code is converted into PTX via LLVM, then executed via the ManagedCUDA API. Kernels can be lambdas.

Download: https://github.com/khyperia/CudaSharp; Last updated Jan 2014; Started Jan 2014; Open source, no license. Author: khyperia – Evan Huack

Note: This project is incomplete. The source code for the API does not build. Further, while in principle the code translates MS CIL, which is a stack-based assembly language, into LLVM, an SSA-based representation, it may not correct because there is no code that performs the aliasing of the stacks from multiple predecessors.

A similar project was written for a Masters Thesis by Soren Horup, Soren Juul, and Herik Larsen at Aalborg University in June 2011. The thesis is http://projekter.aau.dk/projekter/files/52819053/Report.pdf. However, I could not find the source code for the project.

GPU.NET

UK based company TidePowerd.com, started by Nicolas Beecroft and Jack Pappas, produced a CIL-based backend targeting CUDA GPUs. The website is no longer available, so the details aren’t clear. But, there are examples in https://github.com/tidepowerd/GPU.NET-Example-Projects, which indicate kernel markup with “[kernel]” attributes, and a high-level model of the GPU based upon CUDA C++.

GpuLinq

Written in F#, this LINQ-like API contains a compiler that runs queries on the GPU. https://github.com/nessos/GpuLinq/ The current status of the project is unknown.

Brahma

A project of the Software Engineering Research Group at Saint-Petersburg State University of the Russian Federation, the code supposedly was an API that targeted OpenCL. However, it’s no longer available. There is a similar project Brahma.FSharp, which seems to be active (last updated 2014), but the details are unclear. https://sites.google.com/site/semathsrprojects/home

Microsoft Accelerator

This project offered a LINQ-like API contains a compiler that runs queries on the GPU. The status of the project is unclear. http://research.microsoft.com/en-us/projects/Accelerator/

OpenCL.NET

OpenCL is an API for multicore computing, whether it be on a CPU or GPU. The API is platform and device neutral. OpenCL.NET is an API which is a thin wrapper on top of the OpenCL.dll assembly. OpenCL requires one to write the kernels in C. These are compiled, linked, loaded and run by the OpenCL drivers. Note: OpenCL.NET has not be updated since Sep 2013. https://openclnet.codeplex.com/

Example:

using System;
using System.Diagnostics;
using System.Linq;
using OpenCL.Net.Extensions;
using OpenCL.Net;
using System.IO;
using System.Collections.Generic;

namespace Opencl_Device_Query
{
    class Program
    {
        static void Main(string[] args)
        {
            List<String> source_list = new List<string>();            
            try
            {
                using (StreamReader sr = new StreamReader("kernel.cl"))
                {
                    String line = sr.ReadToEnd();
                    source_list.Add(line);
                }
            }
            catch (Exception e)
            {
                Console.WriteLine("The file could not be read:");
                Console.WriteLine(e.Message);
                System.Environment.Exit(1);
            }
            String[] source = source_list.ToArray();
            IntPtr[] lengths = new IntPtr[source.Length];
            for (int i = 0; i < source.Length; ++i) lengths[i] = (IntPtr)source[i].Length;
            ErrorCode e1;
            Context context = Cl.CreateContext("NVIDIA CUDA", DeviceType.Gpu, out e1);
            OpenCL.Net.Program program = Cl.CreateProgramWithSource(context, 1, source, lengths, out e1);
            InfoBuffer ib = Cl.GetContextInfo(context, ContextInfo.Devices, out e1);
            Device[] devices = ib.CastToArray<Device>(ib.Size);
            e1 = Cl.BuildProgram(program, 1, devices, "", null, (IntPtr)0);
            if (e1 != ErrorCode.Success)
            {
                ib = Cl.GetProgramBuildInfo(program, devices[0], ProgramBuildInfo.Log, out e1);
                char[] log = ib.CastToArray<char>(ib.Size);
                System.Console.WriteLine(new String(log));
                System.Environment.Exit(1);
            }
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            for (int i = 0; i < N; i++)
                a[i] = i;
            IMem<int> dev_a = Cl.CreateBuffer<int>(context, MemFlags.ReadWrite, N, out e1);
            Kernel kernel = Cl.CreateKernel(program, "mykernel", out e1);
            e1 = Cl.SetKernelArg(kernel, 0, dev_a);
            e1 = Cl.SetKernelArg(kernel, 1, N);
            CommandQueue command_queue = Cl.CreateCommandQueue(context, devices[0], CommandQueueProperties.None, out e1);
            Event evt;
            e1 = Cl.EnqueueWriteBuffer(command_queue, dev_a, Bool.True, a, 0, null, out evt);
            IntPtr[] tiles;
            IntPtr[] tile_size;
            l2t(grid_size * block_size, 1, out tile_size, out tiles);
            e1 = Cl.EnqueueNDRangeKernel(command_queue, kernel, 1, null, (IntPtr[])tiles, (IntPtr[])tile_size, 0, null, out evt);
            e1 = Cl.EnqueueReadBuffer(command_queue, dev_a, Bool.True, a, 0, null, out evt);
        }

        static int the_blocksize = 256;

        static void l2t(int size, int max_dimensionality, out IntPtr[] tile_size, out IntPtr[] tiles)
        {
            tile_size = new IntPtr[3];
            tiles = new IntPtr[3];            
            for (int j = 0; j < max_dimensionality; ++j)
                tiles[j] = (IntPtr)1;
            int[] max_threads = new int[3]{ the_blocksize, 64, 64};
            int[] max_blocks = new int[3] { 65535, 65535, 65535 };
            for (int j = 0; j < max_dimensionality; ++j)
                tile_size[j] = (IntPtr)1;

            int b = size / (max_threads[0] * max_blocks[0]);
            if (b == 0)
            {
                b = size / max_threads[0];
                if (size % max_threads[0] != 0)
                    b++;

                if (b == 1)
                    max_threads[0] = size;

                // done. return the result.
                tiles[0] = (IntPtr)b;
                tile_size[0] = (IntPtr)max_threads[0];

                // OpenCL uses multiples of tile_size.
                tiles[0] = (IntPtr)((int)tile_size[0] * (int)tiles[0]);
                return;
            }

            int sqrt_size = (int)Math.Sqrt((float)size / max_threads[0]);
            sqrt_size++;

            int b2 = sqrt_size / max_blocks[1];
            if (b2 == 0)
            {
                b = sqrt_size;

                // done. return the result.
                tiles[0] = tiles[1] = (IntPtr)b;
                tile_size[0] = (IntPtr)max_threads[0];

                // OpenCL uses multiples of tile_size.
                tiles[0] = (IntPtr)((int)tile_size[0] * (int)tiles[0]);
                //        tiles[1] *= tile_size[1];
                return;
            }
        }
    }
}

//kernel code
__kernel void mykernel(__global int * a, int N)
{
    int i = get_global_id(0);
	if (i < N)
		a[i] = a[i] * 2;
}

Cloo

An OpenCL API. Note: Although I was able to get the example to work with Cloo, it wasn’t easy. There are no definitive examples nor documentation on the project website. Familiarity with the standard OpenCL API helped me to get something working. I found the API confusing for a couple reasons: sometimes one would use the “new” operator to create an object (e.g., ComputeBuffer), and other times a factory pattern (e.g., program.CreateKernel); sometimes one would access a static (e.g., ComputePlatform.Platforms), then alternatively call a method of an instance (e.g., program.GetBuildLog). Compare the examples of Cloo and OpenCL.NET to explore the differences. API was last updated in 2012. http://cloo.sourceforge.net/ http://sourceforge.net/projects/cloo/

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Cloo;
using Cloo.Bindings;
using System.IO;
using System.Runtime.InteropServices;

namespace ClooSample
{
    class Program
    {
        static void Main(string[] args)
        {
            List<String> source_list = new List<string>();
            try
            {
                using (StreamReader sr = new StreamReader("kernel.cl"))
                {
                    String line = sr.ReadToEnd();
                    source_list.Add(line);
                }
            }
            catch (Exception e)
            {
                Console.WriteLine("The file could not be read:");
                Console.WriteLine(e.Message);
                System.Environment.Exit(1);
            }
            String[] source = source_list.ToArray();
            ComputePlatform cp = ComputePlatform.Platforms[1];
            ComputeContextPropertyList properties = new ComputeContextPropertyList(cp);
            ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu, properties, null, IntPtr.Zero);
            IReadOnlyCollection<ComputeDevice> devices = context.Devices;
            ComputeDevice[] devs = new ComputeDevice[1];
            devs[0] = devices.First();
            Cloo.ComputeProgram program = new ComputeProgram(context, source);
            try
            {
                program.Build(devs, "", (ComputeProgramBuildNotifier)null, IntPtr.Zero);
            }
            catch
            {
                String error = program.GetBuildLog(devs[0]);
                System.Console.WriteLine(error);
                System.Environment.Exit(1);
            }
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            for (int i = 0; i < N; i++)
                a[i] = i;
            Cloo.ComputeBuffer<int> deva = new ComputeBuffer<int>(context, ComputeMemoryFlags.ReadWrite, a);
            Cloo.ComputeKernel kernel = program.CreateKernel("mykernel");
            kernel.SetMemoryArgument(0, deva);
            kernel.SetValueArgument(1, N);
            Cloo.ComputeCommandQueue command_queue = new ComputeCommandQueue(context, devs[0], ComputeCommandQueueFlags.None);
            command_queue.WriteToBuffer(a, deva, true, null);
            long[] tiles;
            long[] tile_size;
            l2t(grid_size * block_size, 1, out tile_size, out tiles);
            command_queue.Execute(kernel, null, tiles, tile_size, null);
            command_queue.ReadFromBuffer(deva, ref a, true, null);
        }

        static int the_blocksize = 256;

        static void l2t(int size, int max_dimensionality, out long[] tile_size, out long[] tiles)
        {
            tile_size = new long[max_dimensionality];
            tiles = new long[max_dimensionality];
            for (int j = 0; j < max_dimensionality; ++j)
                tiles[j] = (long)1;
            int[] max_threads = new int[3] { the_blocksize, 64, 64 };
            int[] max_blocks = new int[3] { 65535, 65535, 65535 };
            for (int j = 0; j < max_dimensionality; ++j)
                tile_size[j] = (long)1;

            int b = size / (max_threads[0] * max_blocks[0]);
            if (b == 0)
            {
                b = size / max_threads[0];
                if (size % max_threads[0] != 0)
                    b++;

                if (b == 1)
                    max_threads[0] = size;

                // done. return the result.
                tiles[0] = (long)b;
                tile_size[0] = (long)max_threads[0];

                // OpenCL uses multiples of tile_size.
                tiles[0] = (long)((int)tile_size[0] * (int)tiles[0]);
                return;
            }

            int sqrt_size = (int)Math.Sqrt((float)size / max_threads[0]);
            sqrt_size++;

            int b2 = sqrt_size / max_blocks[1];
            if (b2 == 0)
            {
                b = sqrt_size;

                // done. return the result.
                tiles[0] = tiles[1] = (long)b;
                tile_size[0] = (long)max_threads[0];

                // OpenCL uses multiples of tile_size.
                tiles[0] = (long)((int)tile_size[0] * (int)tiles[0]);
                //        tiles[1] *= tile_size[1];
                return;
            }
        }
    }
}

Note: A file containing all these examples in MS Visual Studio 2013 projects is available https://domemtech.box.com/s/a472ynr3duu1vlle2i0dnwjpzclfd0k6.

Updated May 3, 2015

First release of Campy.NET

Posted on February 13, 2015February 14, 2015 by kaby76

The first release build of Campy.NET, v0.01.0000, has been uploaded to campynet.codeplex.com. To install, download the file, unzip, then execute setup.exe. Campy only works on Windows 8 or 10, and you will need to install Visual Studio Express for Windows Desktop. (Note, VS Professional or Ultimate will also work.) You won’t need to set the CAMPYNETROOT environmental variable mentioned previously, nor download ILSpy or Uncrustify, since these are now installed by the setup program. This release is built with debugging.

To run the code, create a small “Hello World” program, e.g.:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Campy;
using Campy.Types;

namespace ConsoleApplication10
{
    class Program
    {
        static void Main(string[] args)
        {
            int size = 10;
            int[] data = new int[size];
            for (int i = 0; i < size; ++i) data[i] = 2 * i;
            Extent e = new Extent(size);
            Array_View<int> d = new Array_View<int>(ref data);
            AMP.Parallel_For_Each(d.Extent, (Index idx) =>
            {
                int j = idx[0];
                d[j] = size - j;
            });
            d.Synchronize();
            for (int i = 0; i < size; ++i)
            {
                System.Console.WriteLine(data[i]);
            }
        }
    }
}

Then, in the Solution Explorer, add references to all the DLLs in the Campy.NET installation directory. Finally, build and run the program. Please note: Campy compiles on the fly as the program executes. If you previously built and ran your program, then make changes, Campy will need to rebuild the GPU code. Your program will stop, and prompt you to restart your program in order for Campy to finish rebuilding the GPU code.

Any questions or problems, let me know!

Next, I will be working on enhancing control flow analysis so that function calls of any form will be converted into GPU code. The goal is for programmers to have no need to annotate GPU code using attributes or modifiers. The compiler should be able to do that.

Ken

Array, Tiles, Shared Memory, and Generics

Posted on February 13, 2015February 14, 2015 by kaby76

New data types have been added to Campy.NET:

Array<>, modeled after C++ AMP array<>, is used to represent accelerator memory. Note: only one-dimensional arrays are implemented.
Tiled_Index, modeled after C++ AMP tiled_index, is used to represent thread identification, and synchronization.
Tile_Static, modeled after C++ AMP tile_static, is used to represent shared memory between threads in a tile.
Copy() is used to copy data between the accelerators and CPU.
C# Generics for GPU code has been implemented.

An example of the use of these types is available in here. Two types of reduction are implemented.

A milestone accomplished–computing the value of pi

Posted on February 6, 2015February 6, 2015 by kaby76

Campy.NET has passed a milestone: it can now approximate the value of pi. The code, below, is an implementation of the classic toy example “summing a circle’s area”.

using Campy;
using Campy.Types;
using System;

namespace Test
{
    class Program
    {
        class Point
        {
            public float x;
            public float y;
        }
        static void Main(string[] args)
        {
            int half_size = 1000;
            int size = half_size * half_size;

            Point[] data = new Point[size];
            for (int i = 0; i < size; ++i) data[i] = new Point();
            Array_View<Point> points = new Array_View<Point>(ref data);
            Extent e = new Extent(size);
            AMP.Parallel_For_Each(e, (Index idx) =>
            {
                int i = idx[0];
                points[i].x = (float)(1.0 * (i / half_size) / half_size);
                points[i].y = (float)(1.0 * (i % half_size) / half_size);
            });
            points.Synchronize();
            int[] insc = new int[size];
            Array_View<int> ins = new Array_View<int>(ref insc);
            //ins.discard_data();
            AMP.Parallel_For_Each(e, (Index idx) =>
            {
                int i = idx[0];
                float radius = 1.0f;
                float tx = points[i].x;
                float ty = points[i].y;
                float t = (float)Math.Sqrt(tx*tx + ty*ty);
                ins[i] = (t <= radius) ? 1 : 0;
            });
            Extent e_half = new Extent(half_size);
            int[] count = new int[1];
            count[0] = 0;
            Array_View<int> res = new Array_View<int>(ref count);
            AMP.Parallel_For_Each(e_half, (Index idx) =>
            {
                int i = idx[0];
                for (int j = 1; j < half_size; ++j)
                {
                    int k = i * half_size;
                    int t1 = ins[k + j];
                    int t2 = ins[k];
                    int t3 = t1 + t2;
                    ins[k] = t3;
                    // cannot decompile!!! ins[i * half_size] += ins[i * half_size + j];
                }
                AMP.Atomic_Fetch_Add(ref res, 0, ins[i * half_size]);
            });
            int cou = res[0];
            System.Console.WriteLine("Count is " + cou + " out of " + size);
            float pi = (4.0f * cou) / size;
            System.Console.WriteLine("Pi is " + pi);
        }
    }
}

Notes:

Although the example runs only slightly faster than the CPU equivalent implementation, there are several problems. First, unnecessary memory copy at line 21 occurs between CPU and GPU because the equivalent to discard_data has not yet been implemented. In addition, the last kernel uses an atomic operation to increment the count of the points within the circle (line 58). Converting the code to use the Blelloch method of reduction, and removing the unnecessary array “points” would improve the performance probably an order of magnitude or two. Alternatively, we could count the number of point outside the circle, then compute pi using the difference with the total number of points. In theory, this would result in less contention because there would be fewer points counted. But, this is the joy of GPU programming.

Also, in the last kernel, the ILSpy decompiler, which Campy.NET uses, has a notable problem with the code. The generated C++ AMP code cannot compile. The equivalent code is provided.

Salient Goals of Campy.NET

Posted on February 2, 2015February 7, 2015 by kaby76

Campy.NET is designed with a unique set of features for GPGPU programming. These features are intended for programmers wishing to easily express PRAM-like parallel algorithms for the GPU using C#.

Campy.NET supports interleaving of blocks of CPU and GPU code

Campy.NET is intended to support GPU program using lambda expressions. The primary advantage of this approach is that GPU code can be interleaved with CPU code, which results in clean and easily understandable code.

Let’s start by giving an example: initialize an array of integers with natural numbers. On a CPU, in C#, this is easily done with just a few lines of code, using a for-loop:

for (int j = 0; j < e; ++j)
{
    a[j] = j;
}

In Campy.NET, the code to perform the initialization on the GPU is similarly expressed:

AMP.Parallel_For_Each(e, (Index i) =>
{
    int j = i[0];
    a[j] = j;
});

This contrasts with CUDA, OpenCL, and wrappers for those languages. In all of these platforms, one is required to place the GPU code in the body of a specially annotated function called a kernel, and to call the kernel in a separate function. In fact, CPU and GPU code often are separated into different files.

Let’s consider what this simple array initialization would look like in CUDA. When reading function “kernel” alone, we do not know what range of elements of the array “a” are assigned; when reading function “host_code” alone, we do not know what “kernel” is performing for a given range. Only when read together does the user understand how host_code and kernel interact.

// GPU code
__global__ void kernel(int * a, int e)
{
    int j = threadIdx.x;
    a[j] = j;
}

...

// CPU code
void host_code()
{
    ...
    kernel<<<g,e>>>(a, e);
    ...
}

This problem is known as the Locality of Definition. Since CPU and GPU code blocks are separated, the readability and understandability of the algorithm suffers.

Campy.NET does not use additional syntax to mark GPU code

Campy.NET does not use language annotations to denote what is CPU code vs. GPU code. Campy.NET determines at runtime what code is GPU code when the user calls AMP.Parallel_For_Each.

In CUDA, OpenCL, OpenACC, C++ AMP, etc., the language (C++ or C) is enhanced with modifiers, pragmas, or attributes, in order to annotate GPU code. All code that is intended to run on the GPU must be annotated.

In the previous example, the function “kernel” is annotated with “__global__” to the compiler recognizes that the function is GPU code. In C++ AMP, GPU code is annotated with “restrict(amp)”.

In OpenACC, loops are annotated with pragmas to provide information to the compiler GPU code about which loop to parallelize, and which variables to capture in the body of the for-loop. For example, from the NVIDIA Parallel ForEach blog, the Jacobi iteration is shown:

while ( error > tol && iter < iter_max ) {
    error = 0.f;
 
    #pragma omp parallel for shared(m, n, Anew, A)
    for( int j = 1; j < n-1; j++) {
        for( int i = 1; i < m-1; i++ ) {
            Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
                                 + A[j-1][i] + A[j+1][i]);
            error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));
        }
    }
 
    #pragma omp parallel for shared(m, n, Anew, A)
    for( int j = 0; j < n-1; j++) {
        for( int i = 0; i < m-1; i++ ) {
            A[j][i] = Anew[j][i];   
        }
    }
 
    if(iter % 100 == 0) printf("%d, %0.6fn", iter, error);
 
    iter++;
}

Unfortunately, custom pragmas are not available in the C# compiler, although the Roslyn compiler is now open source, and the compiler could be enhanced.

As mentioned above, in Campy.NET, GPU code is simply a lambda expression passed to the Campy.NET function AMP.Parallel_For_Each(). When the function is executed, Campy.NET performs a control flow analysis to determine what functions to convert to C++ AMP, then compiles, links, and calls the function. Lambda expressions capture variables defined prior to the call to AMP.Parallel_For_Each. In order to prevent the re-translation and compilation of the lambda expression, Campy.NET examines the timestamp of the program on the assemblies (DLL and EXE) for the program.

Campy.NET is modeled after C++ AMP, with enhancements

C++ AMP uses compile-time type checking to enforce the semantics of the API. Sometimes, the error messages yield good information of where to find the error in your program. For example, when using a CPU array inside a GPU code block.

    int * insc = new int[size];
	array_view<int, 1> ins(size, insc);
	ins.discard_data();
	parallel_for_each(e, [=](index<1> idx) restrict(amp)
	{
		int i = idx[0];
		float radius = 1.0;
		float tx = points[i].x;
		float ty = points[i].y;
		float t = sqrt(tx*tx + ty*ty);
		insc[i] = (t <= radius) ? 1 : 0;
	});

Other times, the C++ compiler gives cryptic error messages which do nothing to identify the cause of the problem, e.g., in this case, using a bool (C++ AMP does not support arrays of bools):

    bool * insc = new bool[size];
	array_view<bool, 1> ins(size, insc);
	ins.discard_data();
	parallel_for_each(e, [=](index<1> idx) restrict(amp)
	{
		int i = idx[0];
		float radius = 1.0;
		float tx = points[i].x;
		float ty = points[i].y;
		float t = sqrt(tx*tx + ty*ty);
		ins[i] = (t <= radius) ? true : false;
	});

In C++ AMP, a programmer can only pass a lambda to another either as an local variable (i.e., “auto”), or as a parameter whose type is a template parameter:

void capture_auto()
{
    ...

    auto IsEven = [=](int t) restrict(amp) -> bool
    {
        return t % 2 == 0;
    };

    parallel_for_each(extent<1>(size), [=](index<1> idx) restrict(amp)
    {
        int i = idx[0];
        a[i] = IsEven(i);
    });

    ...
}

template<typename _K>
void pass_as_parameter(const _K & IsOdd)
{
    ...

    parallel_for_each(extent<1>(size), [=](index<1> idx) restrict(amp)
    {
        int i = idx[0];
        a[i] = IsOdd(i);
    });

	...
}

void test2()
{
    auto test = [=](int t) restrict(amp) -> bool
    {
        return t % 2 = 1;
    };
    pass_as_parameter(test);
}

In C++ AMP, tiled_extent/index must use integer constants defined at compile time.

C++ AMP cannot use the C++11 standard for lambda function type definitions via “std::function<>” because C++ AMP requires the addition of the “restrict(amp)” which is not supported in std::function<>.

Campy.NET avoids many of these problems. It provides additional type checking to warn users of problems. Campy.NET allows one to define the type of the lambda expression and use that when passing functions as parameters. The Campy.NET types for tile sizes can be variable.

Campy.NET is an API, not a new language

Campy.NET is an API implemented in the C# language for parallel programming; it does not alter the compiler or runtime. Campy.NET is essentially a C# wrapper for C++ AMP.

Global memory management is seemless

Like C++ AMP, Campy.NET uses a container to wrap the access of a data structure in CPU memory. When accessed in GPU code, the data is copied to GPU memory automatically. After the Parallel_For_Each completes, the data is copied back to CPU memory only if the wrapper is accessed in CPU code.

Support for shared memory and thread synchronization

Campy.NET supports share memory and thread synchronization for efficient sub-division of large problems.

The following is an example from the NVIDIA Parallel ForAll column (http://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/), converted into Campy.NET.

using System;
using Campy;
using Campy.Types;

namespace Reverse
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a large array of numbers, some containing sixes.
            int size = 64;
            int[] data = new int[size];
            for (int i = 0; i < size; ++i) data[i] = i;
            Array_View<int> d = new Array_View<int>(ref data);

            // Initialize...
            Extent e = new Extent(size);
            AMP.Parallel_For_Each(d.Extent, (Index idx) =>
            {
                int j = idx[0];
                d[j] = j;
            });

            // Reverse...
            Tile_Static<int> s = new Tile_Static<int>(64);
            AMP.Parallel_For_Each(d.Extent.Tile(size), (Tiled_Index idx) =>
            {
                int t = idx.Local[0];
                int tr = size - t - 1;
                s[t] = d[t];
                idx.Barrier.Wait();
                d[t] = s[tr];
            });
            for (int i = 0; i < size; ++i)
                System.Console.WriteLine(d[i]);
        }
    }
}

Support for atomic operations

Campy.NET provides atomic addition, subtraction, fetch, and others, for thread synchronized access to memory. An example is the classic “sum of sixes” problem.

using System;
using Campy;
using Campy.Types;

namespace SumOfSixes
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a large array of numbers, some containing sixes.
            int size = 50000000;
            int[] data = new int[size];
            Array_View<int> d = new Array_View<int>(ref data);

            // Initialize...
            Extent e = new Extent(size);
            AMP.Parallel_For_Each(d.Extent, (Index idx) =>
            {
                int j = idx[0];
                d[j] = (j % 30) == 0 ? 6 : j;
            });

            // Count sixes...
            int[] result = new int[1];
            Array_View<int> r = new Array_View<int>(ref result);
            AMP.Parallel_For_Each(d.Extent, (Index idx) =>
            {
                int j = idx[0];
                if (d[j] == 6)
                    AMP.Atomic_Fetch_Add(ref r, 0, 1);
            });
            System.Console.WriteLine("result = " + r[0]);
        }
    }
}

Support for multi-core CPUs or GPUs

Campy.NET supports the execution of parallel code on any multi-core processor, as long as it is supported in C++ AMP. A programmer can easily select the processor to perform the parallel execution of the lambda functions.

Status

That said, the overall design and implementation of Campy.NET API are unset. While there is an effort to model it after C++ AMP, the API must be redesigned to avoid the same pitfalls of C++ AMP, such as not being able to handle pointers. See http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/19/restrict-amp-restrictions-part-0-of-n-introduction.aspx . Currently, it also relies heavily on the Visual Studio C++ compiler, which it should not. If you are interested in helping out, please let me (Ken Domino) know.

(Note: updated Feb 7, 2015 for API changes.)

Running and Debugging a Simple Example

Posted on January 18, 2015February 7, 2015 by kaby76

(Updated January 30, 2015.)

Prerequisites:

(1) You must have Visual Studio 2013 C++ installed.

(2) Download Campy.NET: Go to the Source tab and select download.

(3) Unpack the ZIP file and set up an environmental variable CAMPYNETROOT (either through the Control Panel, or specific for a command-line shell, like bash) to the path of the Campy.NET directory (using Windows backslash style, not forward slash Unix style).

(4) Download the source from ILSpy (ilspy.net). Unpack the ZIP file in the Campy.NET directory.

(5) Download the download and unpack the binaries for Uncrustify, and place it on your PATH. See http://uncrustify.sourceforge.net/

The Example:

Start Visual Studio 2013, and create a C# CLR Console Application. Open the C# source file, and paste the following code into the file. The example provided is the following code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Diagnostics;
using Campy;
using Campy.Types;

namespace Test
{
    class Program
    {
        static void Main(string[] args)
        {
            int size = 100000;
            int[] data = new int[size];
            Extent e = new Extent(size);
            Array_View<int> d = new Array_View<int>(ref data);
            AMP.Parallel_For_Each(d.Extent, (Index idx) =>
            {
                int j = idx[0];
                d[j] = size - j - 1; // Capture size and d.
            });
            d.Synchronize();
            for (int i = 0; i < size; ++i)
            {
                System.Console.WriteLine(data[i]);
            }
        }
    }
}

Add Campy.Types.dll and Campy.NET.dll to the references. Compile and run the project. The output should be “99999 … 99998 … 99997 …”. NOTE: IF YOU MAKE CHANGES TO THE CODE AND REBUILD, THE PROGRAM MAY FAIL TO PERFORM A DYNAMIC LINK! YOU MUST STOP THE PROGRAM AND RERUN IT. THIS IS A PROBLEM WITH WINDOWS, AND NO WAY AT THE MOMENT TO BYPASS IT (YOU CANNOT UNLOAD A DLL THAT HAS ALREADY BEEN LOADED). CAMPY WILL RELINK THE DYNAMIC CODE THAT IT BUILDS AND PLACES IN THE DLL UPON PROGRAM START UP!

To debug the kernel, create a C++ CLR application. Open the Property Pages for the project, select Debugging in the left pane. In the right pane, set the Debugger Type to GPU Only. Select Command in the right pane, and enter the path of the C# executable, “…\ConsoleApplication1.exe”. Select Working Directory and set the path for the C# executable. In Visual Studio, open the generated file System_Void_Test_Program___c__DisplayClass1__Main_b__0_Campy_Types_Index__unmanaged.cpp. Set a breakpoint on a line in the kernel, then start a debug session. The program will stop in the kernel.

(Note: updated Feb 7, 2015 for API changes.)

Campy.NET – An API for GPGPU Computing for C#

Posted on January 18, 2015March 22, 2015 by kaby76

Welcome to Campy.NET! Campy.NET is a library for enabling general purpose GPU computing within the Microsoft Common Language Infrastructure, such as C#. Campy.NET is a API similar to Microsoft’s C++ AMP. The API converts your methods and delegates into kernels via a JIT compiler, which can then be run on a GPU. The source is free and open source.

Campy.NET is written in C#, C++ CLI, and C++ native code, and uses .NET Reflection, ILSpy/Mono.Cecil, Uncrustify, and Microsoft Visual C++, C++ AMP. Unlike CUDA, OpenCL, CUDA.NET, Cudafy, and other solutions for GPGPU programming, users write their programs naturally, without requiring the kernel being defined separately from the calling context, using an anonymous lambda delegate in a AMP.Parallel_For_Each call. When Parallel_For_Each is executed, Campy converts the anonymous lambda delegate into C++ AMP source code, compiles the code, loads the generated DLL, then executes the kernel, performing all the required parameter passing to C++ AMP. Campy retains the generated DLL for future calls to the Parallel_For_Each lambda, regenerating the DLL if it is out of date with respect to the executable program.

Campy.NET is available at http://campynet.codeplex.com/. It is in the preliminary stages of development, and has no documentation. If you would like to help out, please let us know by contacting the administrator for the project at http://campynet.codeplex.com/. To keep up to date, sign up for the newsletter.