February 2015

First release of Campy.NET

Posted on February 13, 2015February 14, 2015 by kaby76

The first release build of Campy.NET, v0.01.0000, has been uploaded to campynet.codeplex.com. To install, download the file, unzip, then execute setup.exe. Campy only works on Windows 8 or 10, and you will need to install Visual Studio Express for Windows Desktop. (Note, VS Professional or Ultimate will also work.) You won’t need to set the CAMPYNETROOT environmental variable mentioned previously, nor download ILSpy or Uncrustify, since these are now installed by the setup program. This release is built with debugging.

To run the code, create a small “Hello World” program, e.g.:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Campy;
using Campy.Types;

namespace ConsoleApplication10
{
    class Program
    {
        static void Main(string[] args)
        {
            int size = 10;
            int[] data = new int[size];
            for (int i = 0; i < size; ++i) data[i] = 2 * i;
            Extent e = new Extent(size);
            Array_View<int> d = new Array_View<int>(ref data);
            AMP.Parallel_For_Each(d.Extent, (Index idx) =>
            {
                int j = idx[0];
                d[j] = size - j;
            });
            d.Synchronize();
            for (int i = 0; i < size; ++i)
            {
                System.Console.WriteLine(data[i]);
            }
        }
    }
}

Then, in the Solution Explorer, add references to all the DLLs in the Campy.NET installation directory. Finally, build and run the program. Please note: Campy compiles on the fly as the program executes. If you previously built and ran your program, then make changes, Campy will need to rebuild the GPU code. Your program will stop, and prompt you to restart your program in order for Campy to finish rebuilding the GPU code.

Any questions or problems, let me know!

Next, I will be working on enhancing control flow analysis so that function calls of any form will be converted into GPU code. The goal is for programmers to have no need to annotate GPU code using attributes or modifiers. The compiler should be able to do that.

Ken

Array, Tiles, Shared Memory, and Generics

Posted on February 13, 2015February 14, 2015 by kaby76

New data types have been added to Campy.NET:

Array<>, modeled after C++ AMP array<>, is used to represent accelerator memory. Note: only one-dimensional arrays are implemented.
Tiled_Index, modeled after C++ AMP tiled_index, is used to represent thread identification, and synchronization.
Tile_Static, modeled after C++ AMP tile_static, is used to represent shared memory between threads in a tile.
Copy() is used to copy data between the accelerators and CPU.
C# Generics for GPU code has been implemented.

An example of the use of these types is available in here. Two types of reduction are implemented.

A milestone accomplished–computing the value of pi

Posted on February 6, 2015February 6, 2015 by kaby76

Campy.NET has passed a milestone: it can now approximate the value of pi. The code, below, is an implementation of the classic toy example “summing a circle’s area”.

using Campy;
using Campy.Types;
using System;

namespace Test
{
    class Program
    {
        class Point
        {
            public float x;
            public float y;
        }
        static void Main(string[] args)
        {
            int half_size = 1000;
            int size = half_size * half_size;

            Point[] data = new Point[size];
            for (int i = 0; i < size; ++i) data[i] = new Point();
            Array_View<Point> points = new Array_View<Point>(ref data);
            Extent e = new Extent(size);
            AMP.Parallel_For_Each(e, (Index idx) =>
            {
                int i = idx[0];
                points[i].x = (float)(1.0 * (i / half_size) / half_size);
                points[i].y = (float)(1.0 * (i % half_size) / half_size);
            });
            points.Synchronize();
            int[] insc = new int[size];
            Array_View<int> ins = new Array_View<int>(ref insc);
            //ins.discard_data();
            AMP.Parallel_For_Each(e, (Index idx) =>
            {
                int i = idx[0];
                float radius = 1.0f;
                float tx = points[i].x;
                float ty = points[i].y;
                float t = (float)Math.Sqrt(tx*tx + ty*ty);
                ins[i] = (t <= radius) ? 1 : 0;
            });
            Extent e_half = new Extent(half_size);
            int[] count = new int[1];
            count[0] = 0;
            Array_View<int> res = new Array_View<int>(ref count);
            AMP.Parallel_For_Each(e_half, (Index idx) =>
            {
                int i = idx[0];
                for (int j = 1; j < half_size; ++j)
                {
                    int k = i * half_size;
                    int t1 = ins[k + j];
                    int t2 = ins[k];
                    int t3 = t1 + t2;
                    ins[k] = t3;
                    // cannot decompile!!! ins[i * half_size] += ins[i * half_size + j];
                }
                AMP.Atomic_Fetch_Add(ref res, 0, ins[i * half_size]);
            });
            int cou = res[0];
            System.Console.WriteLine("Count is " + cou + " out of " + size);
            float pi = (4.0f * cou) / size;
            System.Console.WriteLine("Pi is " + pi);
        }
    }
}

Notes:

Although the example runs only slightly faster than the CPU equivalent implementation, there are several problems. First, unnecessary memory copy at line 21 occurs between CPU and GPU because the equivalent to discard_data has not yet been implemented. In addition, the last kernel uses an atomic operation to increment the count of the points within the circle (line 58). Converting the code to use the Blelloch method of reduction, and removing the unnecessary array “points” would improve the performance probably an order of magnitude or two. Alternatively, we could count the number of point outside the circle, then compute pi using the difference with the total number of points. In theory, this would result in less contention because there would be fewer points counted. But, this is the joy of GPU programming.

Also, in the last kernel, the ILSpy decompiler, which Campy.NET uses, has a notable problem with the code. The generated C++ AMP code cannot compile. The equivalent code is provided.

Salient Goals of Campy.NET

Posted on February 2, 2015February 7, 2015 by kaby76

Campy.NET is designed with a unique set of features for GPGPU programming. These features are intended for programmers wishing to easily express PRAM-like parallel algorithms for the GPU using C#.

Campy.NET supports interleaving of blocks of CPU and GPU code

Campy.NET is intended to support GPU program using lambda expressions. The primary advantage of this approach is that GPU code can be interleaved with CPU code, which results in clean and easily understandable code.

Let’s start by giving an example: initialize an array of integers with natural numbers. On a CPU, in C#, this is easily done with just a few lines of code, using a for-loop:

for (int j = 0; j < e; ++j)
{
    a[j] = j;
}

In Campy.NET, the code to perform the initialization on the GPU is similarly expressed:

AMP.Parallel_For_Each(e, (Index i) =>
{
    int j = i[0];
    a[j] = j;
});

This contrasts with CUDA, OpenCL, and wrappers for those languages. In all of these platforms, one is required to place the GPU code in the body of a specially annotated function called a kernel, and to call the kernel in a separate function. In fact, CPU and GPU code often are separated into different files.

Let’s consider what this simple array initialization would look like in CUDA. When reading function “kernel” alone, we do not know what range of elements of the array “a” are assigned; when reading function “host_code” alone, we do not know what “kernel” is performing for a given range. Only when read together does the user understand how host_code and kernel interact.

// GPU code
__global__ void kernel(int * a, int e)
{
    int j = threadIdx.x;
    a[j] = j;
}

...

// CPU code
void host_code()
{
    ...
    kernel<<<g,e>>>(a, e);
    ...
}

This problem is known as the Locality of Definition. Since CPU and GPU code blocks are separated, the readability and understandability of the algorithm suffers.

Campy.NET does not use additional syntax to mark GPU code

Campy.NET does not use language annotations to denote what is CPU code vs. GPU code. Campy.NET determines at runtime what code is GPU code when the user calls AMP.Parallel_For_Each.

In CUDA, OpenCL, OpenACC, C++ AMP, etc., the language (C++ or C) is enhanced with modifiers, pragmas, or attributes, in order to annotate GPU code. All code that is intended to run on the GPU must be annotated.

In the previous example, the function “kernel” is annotated with “__global__” to the compiler recognizes that the function is GPU code. In C++ AMP, GPU code is annotated with “restrict(amp)”.

In OpenACC, loops are annotated with pragmas to provide information to the compiler GPU code about which loop to parallelize, and which variables to capture in the body of the for-loop. For example, from the NVIDIA Parallel ForEach blog, the Jacobi iteration is shown:

while ( error > tol && iter < iter_max ) {
    error = 0.f;
 
    #pragma omp parallel for shared(m, n, Anew, A)
    for( int j = 1; j < n-1; j++) {
        for( int i = 1; i < m-1; i++ ) {
            Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
                                 + A[j-1][i] + A[j+1][i]);
            error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));
        }
    }
 
    #pragma omp parallel for shared(m, n, Anew, A)
    for( int j = 0; j < n-1; j++) {
        for( int i = 0; i < m-1; i++ ) {
            A[j][i] = Anew[j][i];   
        }
    }
 
    if(iter % 100 == 0) printf("%d, %0.6fn", iter, error);
 
    iter++;
}

Unfortunately, custom pragmas are not available in the C# compiler, although the Roslyn compiler is now open source, and the compiler could be enhanced.

As mentioned above, in Campy.NET, GPU code is simply a lambda expression passed to the Campy.NET function AMP.Parallel_For_Each(). When the function is executed, Campy.NET performs a control flow analysis to determine what functions to convert to C++ AMP, then compiles, links, and calls the function. Lambda expressions capture variables defined prior to the call to AMP.Parallel_For_Each. In order to prevent the re-translation and compilation of the lambda expression, Campy.NET examines the timestamp of the program on the assemblies (DLL and EXE) for the program.

Campy.NET is modeled after C++ AMP, with enhancements

C++ AMP uses compile-time type checking to enforce the semantics of the API. Sometimes, the error messages yield good information of where to find the error in your program. For example, when using a CPU array inside a GPU code block.

    int * insc = new int[size];
	array_view<int, 1> ins(size, insc);
	ins.discard_data();
	parallel_for_each(e, [=](index<1> idx) restrict(amp)
	{
		int i = idx[0];
		float radius = 1.0;
		float tx = points[i].x;
		float ty = points[i].y;
		float t = sqrt(tx*tx + ty*ty);
		insc[i] = (t <= radius) ? 1 : 0;
	});

Other times, the C++ compiler gives cryptic error messages which do nothing to identify the cause of the problem, e.g., in this case, using a bool (C++ AMP does not support arrays of bools):

    bool * insc = new bool[size];
	array_view<bool, 1> ins(size, insc);
	ins.discard_data();
	parallel_for_each(e, [=](index<1> idx) restrict(amp)
	{
		int i = idx[0];
		float radius = 1.0;
		float tx = points[i].x;
		float ty = points[i].y;
		float t = sqrt(tx*tx + ty*ty);
		ins[i] = (t <= radius) ? true : false;
	});

In C++ AMP, a programmer can only pass a lambda to another either as an local variable (i.e., “auto”), or as a parameter whose type is a template parameter:

void capture_auto()
{
    ...

    auto IsEven = [=](int t) restrict(amp) -> bool
    {
        return t % 2 == 0;
    };

    parallel_for_each(extent<1>(size), [=](index<1> idx) restrict(amp)
    {
        int i = idx[0];
        a[i] = IsEven(i);
    });

    ...
}

template<typename _K>
void pass_as_parameter(const _K & IsOdd)
{
    ...

    parallel_for_each(extent<1>(size), [=](index<1> idx) restrict(amp)
    {
        int i = idx[0];
        a[i] = IsOdd(i);
    });

	...
}

void test2()
{
    auto test = [=](int t) restrict(amp) -> bool
    {
        return t % 2 = 1;
    };
    pass_as_parameter(test);
}

In C++ AMP, tiled_extent/index must use integer constants defined at compile time.

C++ AMP cannot use the C++11 standard for lambda function type definitions via “std::function<>” because C++ AMP requires the addition of the “restrict(amp)” which is not supported in std::function<>.

Campy.NET avoids many of these problems. It provides additional type checking to warn users of problems. Campy.NET allows one to define the type of the lambda expression and use that when passing functions as parameters. The Campy.NET types for tile sizes can be variable.

Campy.NET is an API, not a new language

Campy.NET is an API implemented in the C# language for parallel programming; it does not alter the compiler or runtime. Campy.NET is essentially a C# wrapper for C++ AMP.

Global memory management is seemless

Like C++ AMP, Campy.NET uses a container to wrap the access of a data structure in CPU memory. When accessed in GPU code, the data is copied to GPU memory automatically. After the Parallel_For_Each completes, the data is copied back to CPU memory only if the wrapper is accessed in CPU code.

Support for shared memory and thread synchronization

Campy.NET supports share memory and thread synchronization for efficient sub-division of large problems.

The following is an example from the NVIDIA Parallel ForAll column (http://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/), converted into Campy.NET.

using System;
using Campy;
using Campy.Types;

namespace Reverse
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a large array of numbers, some containing sixes.
            int size = 64;
            int[] data = new int[size];
            for (int i = 0; i < size; ++i) data[i] = i;
            Array_View<int> d = new Array_View<int>(ref data);

            // Initialize...
            Extent e = new Extent(size);
            AMP.Parallel_For_Each(d.Extent, (Index idx) =>
            {
                int j = idx[0];
                d[j] = j;
            });

            // Reverse...
            Tile_Static<int> s = new Tile_Static<int>(64);
            AMP.Parallel_For_Each(d.Extent.Tile(size), (Tiled_Index idx) =>
            {
                int t = idx.Local[0];
                int tr = size - t - 1;
                s[t] = d[t];
                idx.Barrier.Wait();
                d[t] = s[tr];
            });
            for (int i = 0; i < size; ++i)
                System.Console.WriteLine(d[i]);
        }
    }
}

Support for atomic operations

Campy.NET provides atomic addition, subtraction, fetch, and others, for thread synchronized access to memory. An example is the classic “sum of sixes” problem.

using System;
using Campy;
using Campy.Types;

namespace SumOfSixes
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a large array of numbers, some containing sixes.
            int size = 50000000;
            int[] data = new int[size];
            Array_View<int> d = new Array_View<int>(ref data);

            // Initialize...
            Extent e = new Extent(size);
            AMP.Parallel_For_Each(d.Extent, (Index idx) =>
            {
                int j = idx[0];
                d[j] = (j % 30) == 0 ? 6 : j;
            });

            // Count sixes...
            int[] result = new int[1];
            Array_View<int> r = new Array_View<int>(ref result);
            AMP.Parallel_For_Each(d.Extent, (Index idx) =>
            {
                int j = idx[0];
                if (d[j] == 6)
                    AMP.Atomic_Fetch_Add(ref r, 0, 1);
            });
            System.Console.WriteLine("result = " + r[0]);
        }
    }
}

Support for multi-core CPUs or GPUs

Campy.NET supports the execution of parallel code on any multi-core processor, as long as it is supported in C++ AMP. A programmer can easily select the processor to perform the parallel execution of the lambda functions.

Status

That said, the overall design and implementation of Campy.NET API are unset. While there is an effort to model it after C++ AMP, the API must be redesigned to avoid the same pitfalls of C++ AMP, such as not being able to handle pointers. See http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/19/restrict-amp-restrictions-part-0-of-n-introduction.aspx . Currently, it also relies heavily on the Visual Studio C++ compiler, which it should not. If you are interested in helping out, please let me (Ken Domino) know.

(Note: updated Feb 7, 2015 for API changes.)