CLI-based APIs for GPGPU Computing

How does Campy compare to other CLI-based (e.g., C#, F#) APIs for GPGPU computing? The following is a partial list of what is currently available along with an example, which computes the even natural numbers.

Cudafy

Cudafy is an API targeting CUDA and OpenCL Programs and kernels are written in C# or other CIL languages. Cudafy scans the CIL, converts GPU code (labeled with the “[Cudafy]” attribute) into CUDA C++ or OpenCL, compiles the code into PTX, then run using CUDA.NET. Cudafy also uses Cloo (see below for more information on Cloo).

Note: This API is very good, and maintained fairly often. It is easy to use, and well documented.

Author: Nicholas Kopp; Download: http://cudafy.codeplex.com/  http://www.hybriddsp.com/Products/CUDAfyNET.aspx; Last updated Jan 2015; Started ~Jun 2011; Open source, GNU Library General Public License (LGPL).

Example:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Cudafy;
using Cudafy.Host;
using Cudafy.Translator;

namespace CudafyExample
{
    class Program
    {
        static void Main(string[] args)
        {
            CudafyModes.Target = eGPUType.Cuda;
            CudafyModes.DeviceId = 0;
            CudafyTranslator.Language = CudafyModes.Target == eGPUType.OpenCL ? eLanguage.OpenCL : eLanguage.Cuda;
            GPGPU gpu = CudafyHost.GetDevice(CudafyModes.Target, 0);
            eArchitecture arch = gpu.GetArchitecture();
            CudafyModule km = CudafyTranslator.Cudafy(arch);
            gpu.LoadModule(km);
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            int[] dev_a = gpu.Allocate<int>(a);
            for (int i = 0; i < N; i++)
                a[i] = i;
            gpu.CopyToDevice(a, dev_a);
            gpu.Launch(grid_size, block_size, fun, dev_a, N);
            gpu.CopyFromDevice(dev_a, a);
            gpu.Free(dev_a);
        }

        [Cudafy]
        public static void fun(GThread thread, int[] a, int N)
        {
            int tid = thread.blockIdx.x;
            if (tid < N)
            {
                a[tid] = a[tid] * 2;
            }
        }
    }
}

ManagedCUDA

ManagedCUDA is an API for the CUDA Driver API. The user writes kernels in CUDA C++, then compiles them into PTX via the CUDA compiler. ManagedCUDA executes the PTX kernels. Copying data to/from the GPU is achieved via an assignment, which is a nice simplification in syntax.

Note: This API is also well maintained, and easy to use. The main issue is that kernels must be written in CUDA C++, then compiled into PTX. If you’re looking for a way of writing kernels in C#, this solution isn’t for you. But, it is an excellent, straight forward API to the CUDA Driver API.

Author: kunzmi (Michael Kunz); Download: http://managedcuda.codeplex.com/; Last updated Feb 2015; Started ~ Jan 2011; Open source, GNU Library General Public License (LGPL).

Example:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using ManagedCuda;
using ManagedCuda.BasicTypes;
using ManagedCuda.VectorTypes;
using System.IO;
using System.Reflection;

namespace ManagedCudaExample
{
    class Program
    {
        static void Main(string[] args)
        {
            CudaContext ctx = new CudaContext(CudaContext.GetMaxGflopsDeviceId());
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            CudaDeviceVariable<int> dev_a = new CudaDeviceVariable<int>(N);
            for (int i = 0; i < N; i++)
                a[i] = i;
            dev_a = a;
            string resName = "kernel.ptx";
            string resNamespace = "ManagedCudaExample";
            string resource = resNamespace + "." + resName;
            Stream stream = Assembly.GetExecutingAssembly().GetManifestResourceStream(resource);
            if (stream == null) throw new ArgumentException("Kernel not found in resources.");
            CudaKernel kernel = ctx.LoadKernelPTX(stream, "kernel");
            int threadsPerBlock = block_size;
            kernel.BlockDimensions = block_size;
            kernel.GridDimensions = grid_size;
            kernel.Run(dev_a.DevicePointer, N);
            a = dev_a;
        }
    }
}

ManagedCUDA requires kernels built in PTX. The following is the kernel in CUDA C++, which can be compiled into PTX via the CUDA compiler with the –keep option.

//Includes for IntelliSense 
#include <cuda.h>
 
extern "C"  {
    //kernel code
	__global__ void kernel(int * a, int N)
	{
		int i = blockDim.x * blockIdx.x + threadIdx.x;
		if (i < N)
			a[i] = a[i] * 2;
	}
}

CUDA.NET

CUDA.NET is an API for the CUDA Driver API. Users write kernels in CUDA C++, compile them into PTX, then execute the PTX with CUDA.NET.

Download: http://www.cass-hpc.com/solutions/libraries/cuda-net; Last updated 2010; Started ~Apr 2008.

Notes: Supposedly, the code for CUDA.NET is open source, and has no license requirements. However, the archive page does not have a download link for source, just the binaries. I found the source in the Cudafy source tree. The latest build is for CUDA 3.0, which is very old (currently, NVIDIA is on CUDA 7.0). As I mentioned, the only source I could find is under Cudafy (Cudafy/3p/cuda.net3.0.0_win/src/…), which has been updated for Cudafy and CUDA 6.5. The documentation in the latest download, v3.0, is unreadable. The only other documentation I could find is for v2.0, http://www.hoopoe-cloud.com/files/cuda.net/2.0/CUDA.NET_2.0.pdf, but contains numerous inaccuracies.

Example:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using GASS.CUDA;
using GASS.CUDA.Types;
using GASS.Types;

namespace CudaNetExample
{
    class Program
    {
        static void Main(string[] args)
        {
            CUResult i0 = CUDADriver.cuInit(0);
            CUdevice dev = new CUdevice();
            CUResult i1 = CUDADriver.cuDeviceGet(ref dev, 0);
            CUcontext ctx = new CUcontext();
            CUResult i2 = CUDADriver.cuCtxCreate(ref ctx, 0, dev);

            CUdeviceptr p1 = new CUdeviceptr();
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            for (int i = 0; i < N; i++)
                a[i] = i;
            CUResult s1 = CUDADriver.cuMemAlloc(ref p1, N * sizeof(int));
            CUResult s2 = CUDADriver.cuMemcpyHtoD(p1, a, a.Length * sizeof(int));
            CUmodule mod = new CUmodule();
            CUResult s3 = CUDADriver.cuModuleLoad(ref mod, @"C:\Users\Ken\Documents\Visual Studio 2013\Projects\ManagedCudaExample\Win32Project1\Debug\kernel.ptx");
            CUfunction func = new CUfunction();
            CUResult s4 = CUDADriver.cuModuleGetFunction(ref func, mod, "kernel");

            CUResult r1 = CUDADriver.cuParamSeti(func, 0, (uint)p1.Pointer);
            CUResult r2 = CUDADriver.cuParamSeti(func, 4, (uint)N);
            CUResult r3 = CUDADriver.cuParamSetSize(func, 8);
            CUResult r4 = CUDADriver.cuFuncSetBlockShape(func, block_size, 1, 1);
            CUResult r5 = CUDADriver.cuLaunchGrid(func, grid_size, 1);
            CUResult r6 = CUDADriver.cuCtxSynchronize();

            CUResult f1 = CUDADriver.cuMemcpyDtoH(a, p1, a.Length * sizeof(int));
 
        }
    }
}

Alea GPU

Alea GPU is an API similar to Cudafy. Kernels are annotated with the “[AOTCompile]” attribute, which are then compiled for the GPU via LLVM.

Author: Daniel Egloff; Download: http://quantalea.com/homehttp://quantalea.com/static/app/tutorial/quick_start/quick_start_example.html; Last updated Mar 2015. Proprietary.

Example:

using System;
using Alea.CUDA;
using Alea.CUDA.Utilities;
using Alea.CUDA.IL;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace AleaExample
{
    class Program
    {
        [AOTCompile]
        static void kernel(deviceptr<int> a, int n)
        {
            var i = blockIdx.x * blockDim.x + threadIdx.x;
            if (i < n)
                a[i] = a[i] * 2;
        }

        static void Main(string[] args)
        {
            var inputs = Enumerable.Range(0, 1000000).Select(i => i).ToArray();
            var worker = Worker.Default;
            DeviceMemory<int> dInputs = worker.Malloc(inputs);
            const int blockSize = 256;
            var numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT;
            var gridSize = Math.Min(16 * numSm, Common.divup(inputs.Length, blockSize));
            var lp = new LaunchParam(gridSize, blockSize);
            worker.Launch<deviceptr<int>, int>(kernel, lp, dInputs.Ptr, inputs.Length);
        }
    }
}

Note: I could not get the example to compile and run. The Alea GPU compiler requires a license, which I could not get because the email confirmation link produced an error on the company’s server.

CudaSharp

In CudaSharp, users write kernels in C#. The GPU code is converted into PTX via LLVM, then executed via the ManagedCUDA API. Kernels can be lambdas.

Download: https://github.com/khyperia/CudaSharp; Last updated Jan 2014; Started Jan 2014; Open source, no license. Author: khyperia – Evan Huack

Note: This project is incomplete. The source code for the API does not build. Further, while in principle the code translates MS CIL, which is a stack-based assembly language, into LLVM, an SSA-based representation, it may not correct because there is no code that performs the aliasing of the stacks from multiple predecessors.

A similar project was written for a Masters Thesis by Soren Horup, Soren Juul, and Herik Larsen at Aalborg University in June 2011. The thesis is http://projekter.aau.dk/projekter/files/52819053/Report.pdf. However, I could not find the source code for the project.

GPU.NET

UK based company TidePowerd.com, started by Nicolas Beecroft and Jack Pappas, produced a CIL-based backend targeting CUDA GPUs. The website is no longer available, so the details aren’t clear. But, there are examples in https://github.com/tidepowerd/GPU.NET-Example-Projects, which indicate kernel markup with “[kernel]” attributes, and a high-level model of the GPU based upon CUDA C++.

GpuLinq

Written in F#, this LINQ-like API contains a compiler that runs queries on the GPU. https://github.com/nessos/GpuLinq/ The current status of the project is unknown.

Brahma

A project of the Software Engineering Research Group at Saint-Petersburg State University of the Russian Federation, the code supposedly was an API that targeted OpenCL. However, it’s no longer available. There is a similar project Brahma.FSharp, which seems to be active (last updated 2014), but the details are unclear. https://sites.google.com/site/semathsrprojects/home

Microsoft Accelerator

This project offered a LINQ-like API contains a compiler that runs queries on the GPU. The status of the project is unclear. http://research.microsoft.com/en-us/projects/Accelerator/

OpenCL.NET

OpenCL is an API for multicore computing, whether it be on a CPU or GPU. The API is platform and device neutral. OpenCL.NET is an API which is a thin wrapper on top of the OpenCL.dll assembly. OpenCL requires one to write the kernels in C. These are compiled, linked, loaded and run by the OpenCL drivers. Note: OpenCL.NET has not be updated since Sep 2013. https://openclnet.codeplex.com/

Example:

using System;
using System.Diagnostics;
using System.Linq;
using OpenCL.Net.Extensions;
using OpenCL.Net;
using System.IO;
using System.Collections.Generic;

namespace Opencl_Device_Query
{
    class Program
    {
        static void Main(string[] args)
        {
            List<String> source_list = new List<string>();            
            try
            {
                using (StreamReader sr = new StreamReader("kernel.cl"))
                {
                    String line = sr.ReadToEnd();
                    source_list.Add(line);
                }
            }
            catch (Exception e)
            {
                Console.WriteLine("The file could not be read:");
                Console.WriteLine(e.Message);
                System.Environment.Exit(1);
            }
            String[] source = source_list.ToArray();
            IntPtr[] lengths = new IntPtr[source.Length];
            for (int i = 0; i < source.Length; ++i) lengths[i] = (IntPtr)source[i].Length;
            ErrorCode e1;
            Context context = Cl.CreateContext("NVIDIA CUDA", DeviceType.Gpu, out e1);
            OpenCL.Net.Program program = Cl.CreateProgramWithSource(context, 1, source, lengths, out e1);
            InfoBuffer ib = Cl.GetContextInfo(context, ContextInfo.Devices, out e1);
            Device[] devices = ib.CastToArray<Device>(ib.Size);
            e1 = Cl.BuildProgram(program, 1, devices, "", null, (IntPtr)0);
            if (e1 != ErrorCode.Success)
            {
                ib = Cl.GetProgramBuildInfo(program, devices[0], ProgramBuildInfo.Log, out e1);
                char[] log = ib.CastToArray<char>(ib.Size);
                System.Console.WriteLine(new String(log));
                System.Environment.Exit(1);
            }
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            for (int i = 0; i < N; i++)
                a[i] = i;
            IMem<int> dev_a = Cl.CreateBuffer<int>(context, MemFlags.ReadWrite, N, out e1);
            Kernel kernel = Cl.CreateKernel(program, "mykernel", out e1);
            e1 = Cl.SetKernelArg(kernel, 0, dev_a);
            e1 = Cl.SetKernelArg(kernel, 1, N);
            CommandQueue command_queue = Cl.CreateCommandQueue(context, devices[0], CommandQueueProperties.None, out e1);
            Event evt;
            e1 = Cl.EnqueueWriteBuffer(command_queue, dev_a, Bool.True, a, 0, null, out evt);
            IntPtr[] tiles;
            IntPtr[] tile_size;
            l2t(grid_size * block_size, 1, out tile_size, out tiles);
            e1 = Cl.EnqueueNDRangeKernel(command_queue, kernel, 1, null, (IntPtr[])tiles, (IntPtr[])tile_size, 0, null, out evt);
            e1 = Cl.EnqueueReadBuffer(command_queue, dev_a, Bool.True, a, 0, null, out evt);
        }

        static int the_blocksize = 256;

        static void l2t(int size, int max_dimensionality, out IntPtr[] tile_size, out IntPtr[] tiles)
        {
            tile_size = new IntPtr[3];
            tiles = new IntPtr[3];            
            for (int j = 0; j < max_dimensionality; ++j)
                tiles[j] = (IntPtr)1;
            int[] max_threads = new int[3]{ the_blocksize, 64, 64};
            int[] max_blocks = new int[3] { 65535, 65535, 65535 };
            for (int j = 0; j < max_dimensionality; ++j)
                tile_size[j] = (IntPtr)1;

            int b = size / (max_threads[0] * max_blocks[0]);
            if (b == 0)
            {
                b = size / max_threads[0];
                if (size % max_threads[0] != 0)
                    b++;

                if (b == 1)
                    max_threads[0] = size;

                // done. return the result.
                tiles[0] = (IntPtr)b;
                tile_size[0] = (IntPtr)max_threads[0];

                // OpenCL uses multiples of tile_size.
                tiles[0] = (IntPtr)((int)tile_size[0] * (int)tiles[0]);
                return;
            }

            int sqrt_size = (int)Math.Sqrt((float)size / max_threads[0]);
            sqrt_size++;

            int b2 = sqrt_size / max_blocks[1];
            if (b2 == 0)
            {
                b = sqrt_size;

                // done. return the result.
                tiles[0] = tiles[1] = (IntPtr)b;
                tile_size[0] = (IntPtr)max_threads[0];

                // OpenCL uses multiples of tile_size.
                tiles[0] = (IntPtr)((int)tile_size[0] * (int)tiles[0]);
                //        tiles[1] *= tile_size[1];
                return;
            }
        }
    }
}


//kernel code
__kernel void mykernel(__global int * a, int N)
{
    int i = get_global_id(0);
	if (i < N)
		a[i] = a[i] * 2;
}

Cloo

An OpenCL API. Note: Although I was able to get the example to work with Cloo, it wasn’t easy. There are no definitive examples nor documentation on the project website. Familiarity with the standard OpenCL API helped me to get something working. I found the API confusing for a couple reasons: sometimes one would use the “new” operator to create an object (e.g., ComputeBuffer), and other times a factory pattern (e.g., program.CreateKernel); sometimes one would access a static (e.g., ComputePlatform.Platforms), then alternatively call a method of an instance (e.g., program.GetBuildLog). Compare the examples of Cloo and OpenCL.NET to explore the differences. API was last updated in 2012.  http://cloo.sourceforge.net/ http://sourceforge.net/projects/cloo/

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Cloo;
using Cloo.Bindings;
using System.IO;
using System.Runtime.InteropServices;

namespace ClooSample
{
    class Program
    {
        static void Main(string[] args)
        {
            List<String> source_list = new List<string>();
            try
            {
                using (StreamReader sr = new StreamReader("kernel.cl"))
                {
                    String line = sr.ReadToEnd();
                    source_list.Add(line);
                }
            }
            catch (Exception e)
            {
                Console.WriteLine("The file could not be read:");
                Console.WriteLine(e.Message);
                System.Environment.Exit(1);
            }
            String[] source = source_list.ToArray();
            ComputePlatform cp = ComputePlatform.Platforms[1];
            ComputeContextPropertyList properties = new ComputeContextPropertyList(cp);
            ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu, properties, null, IntPtr.Zero);
            IReadOnlyCollection<ComputeDevice> devices = context.Devices;
            ComputeDevice[] devs = new ComputeDevice[1];
            devs[0] = devices.First();
            Cloo.ComputeProgram program = new ComputeProgram(context, source);
            try
            {
                program.Build(devs, "", (ComputeProgramBuildNotifier)null, IntPtr.Zero);
            }
            catch
            {
                String error = program.GetBuildLog(devs[0]);
                System.Console.WriteLine(error);
                System.Environment.Exit(1);
            }
            int grid_size = 5000;
            int block_size = 512;
            int N = grid_size * block_size;
            int[] a = new int[N];
            for (int i = 0; i < N; i++)
                a[i] = i;
            Cloo.ComputeBuffer<int> deva = new ComputeBuffer<int>(context, ComputeMemoryFlags.ReadWrite, a);
            Cloo.ComputeKernel kernel = program.CreateKernel("mykernel");
            kernel.SetMemoryArgument(0, deva);
            kernel.SetValueArgument(1, N);
            Cloo.ComputeCommandQueue command_queue = new ComputeCommandQueue(context, devs[0], ComputeCommandQueueFlags.None);
            command_queue.WriteToBuffer(a, deva, true, null);
            long[] tiles;
            long[] tile_size;
            l2t(grid_size * block_size, 1, out tile_size, out tiles);
            command_queue.Execute(kernel, null, tiles, tile_size, null);
            command_queue.ReadFromBuffer(deva, ref a, true, null);
        }

        static int the_blocksize = 256;

        static void l2t(int size, int max_dimensionality, out long[] tile_size, out long[] tiles)
        {
            tile_size = new long[max_dimensionality];
            tiles = new long[max_dimensionality];
            for (int j = 0; j < max_dimensionality; ++j)
                tiles[j] = (long)1;
            int[] max_threads = new int[3] { the_blocksize, 64, 64 };
            int[] max_blocks = new int[3] { 65535, 65535, 65535 };
            for (int j = 0; j < max_dimensionality; ++j)
                tile_size[j] = (long)1;

            int b = size / (max_threads[0] * max_blocks[0]);
            if (b == 0)
            {
                b = size / max_threads[0];
                if (size % max_threads[0] != 0)
                    b++;

                if (b == 1)
                    max_threads[0] = size;

                // done. return the result.
                tiles[0] = (long)b;
                tile_size[0] = (long)max_threads[0];

                // OpenCL uses multiples of tile_size.
                tiles[0] = (long)((int)tile_size[0] * (int)tiles[0]);
                return;
            }

            int sqrt_size = (int)Math.Sqrt((float)size / max_threads[0]);
            sqrt_size++;

            int b2 = sqrt_size / max_blocks[1];
            if (b2 == 0)
            {
                b = sqrt_size;

                // done. return the result.
                tiles[0] = tiles[1] = (long)b;
                tile_size[0] = (long)max_threads[0];

                // OpenCL uses multiples of tile_size.
                tiles[0] = (long)((int)tile_size[0] * (int)tiles[0]);
                //        tiles[1] *= tile_size[1];
                return;
            }
        }
    }
}


Note: A file containing all these examples in MS Visual Studio 2013 projects is available https://domemtech.box.com/s/a472ynr3duu1vlle2i0dnwjpzclfd0k6.

Updated May 3, 2015