Salient Goals of Campy.NET

Campy.NET is designed with a unique set of features for GPGPU programming. These features are intended for programmers wishing to easily express PRAM-like parallel algorithms for the GPU using C#.

Campy.NET supports interleaving of blocks of CPU and GPU code

Campy.NET is intended to support GPU program using lambda expressions. The primary advantage of this approach is that GPU code can be interleaved with CPU code, which results in clean and easily understandable code.

Let’s start by giving an example: initialize an array of integers with natural numbers.  On a CPU, in C#, this is easily done with just a few lines of code, using a for-loop:

for (int j = 0; j < e; ++j)
{
    a[j] = j;
}

In Campy.NET, the code to perform the initialization on the GPU is similarly expressed:

AMP.Parallel_For_Each(e, (Index i) =>
{
    int j = i[0];
    a[j] = j;
});

This contrasts with CUDA, OpenCL, and wrappers for those languages. In all of these platforms, one is required to place the GPU code in the body of a specially annotated function called a kernel, and to call the kernel in a separate function. In fact, CPU and GPU code often are separated into different files.

Let’s consider what this simple array initialization would look like in CUDA. When reading function “kernel” alone, we do not know what range of elements of the array “a” are assigned; when reading function “host_code” alone, we do not know what “kernel” is performing for a given range. Only when read together does the user understand how host_code and kernel interact.

// GPU code
__global__ void kernel(int * a, int e)
{
    int j = threadIdx.x;
    a[j] = j;
}

...

// CPU code
void host_code()
{
    ...
    kernel<<<g,e>>>(a, e);
    ...
}

This problem is known as the Locality of Definition. Since CPU and GPU code blocks are separated, the readability and understandability of the algorithm suffers.

Campy.NET does not use additional syntax to mark GPU code

Campy.NET does not use language annotations to denote what is CPU code vs. GPU code. Campy.NET determines at runtime what code is GPU code when the user calls AMP.Parallel_For_Each.

In CUDA, OpenCL, OpenACC, C++ AMP, etc., the language (C++ or C) is enhanced with modifiers, pragmas, or attributes, in order to annotate GPU code. All code that is intended to run on the GPU must be annotated.

In the previous example, the function “kernel” is annotated with “__global__” to the compiler recognizes that the function is GPU code. In C++ AMP, GPU code is annotated with “restrict(amp)”.

In OpenACC, loops are annotated with pragmas to provide information to the compiler GPU code about which loop to parallelize, and which variables to capture in the body of the for-loop. For example, from the NVIDIA Parallel ForEach blog, the Jacobi iteration is shown:

while ( error > tol && iter < iter_max ) {
    error = 0.f;
 
    #pragma omp parallel for shared(m, n, Anew, A)
    for( int j = 1; j < n-1; j++) {
        for( int i = 1; i < m-1; i++ ) {
            Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
                                 + A[j-1][i] + A[j+1][i]);
            error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));
        }
    }
 
    #pragma omp parallel for shared(m, n, Anew, A)
    for( int j = 0; j < n-1; j++) {
        for( int i = 0; i < m-1; i++ ) {
            A[j][i] = Anew[j][i];   
        }
    }
 
    if(iter % 100 == 0) printf("%d, %0.6fn", iter, error);
 
    iter++;
}

Unfortunately, custom pragmas are not available in the C# compiler, although the Roslyn compiler is now open source, and the compiler could be enhanced.

As mentioned above, in Campy.NET, GPU code is simply a lambda expression passed to the Campy.NET function AMP.Parallel_For_Each(). When the function is executed, Campy.NET performs a control flow analysis to determine what functions to convert to C++ AMP, then compiles, links, and calls the function. Lambda expressions capture variables defined prior to the call to AMP.Parallel_For_Each. In order to prevent the re-translation and compilation of the lambda expression, Campy.NET examines the timestamp of the program on the assemblies (DLL and EXE) for the program.

Campy.NET is modeled after C++ AMP, with enhancements

C++ AMP uses compile-time type checking to enforce the semantics of the API. Sometimes, the error messages yield good information of where to find the error in your program. For example, when using a CPU array inside a GPU code block.

    int * insc = new int[size];
	array_view<int, 1> ins(size, insc);
	ins.discard_data();
	parallel_for_each(e, [=](index<1> idx) restrict(amp)
	{
		int i = idx[0];
		float radius = 1.0;
		float tx = points[i].x;
		float ty = points[i].y;
		float t = sqrt(tx*tx + ty*ty);
		insc[i] = (t <= radius) ? 1 : 0;
	});

Other times, the C++ compiler gives cryptic error messages which do nothing to identify the cause of the problem, e.g., in this case, using a bool (C++ AMP does not support arrays of bools):

    bool * insc = new bool[size];
	array_view<bool, 1> ins(size, insc);
	ins.discard_data();
	parallel_for_each(e, [=](index<1> idx) restrict(amp)
	{
		int i = idx[0];
		float radius = 1.0;
		float tx = points[i].x;
		float ty = points[i].y;
		float t = sqrt(tx*tx + ty*ty);
		ins[i] = (t <= radius) ? true : false;
	});

In C++ AMP, a programmer can only pass a lambda to another either as an local variable (i.e., “auto”), or as a parameter whose type is a template parameter:

void capture_auto()
{
    ...

    auto IsEven = [=](int t) restrict(amp) -> bool
    {
        return t % 2 == 0;
    };

    parallel_for_each(extent<1>(size), [=](index<1> idx) restrict(amp)
    {
        int i = idx[0];
        a[i] = IsEven(i);
    });

    ...
}

template<typename _K>
void pass_as_parameter(const _K & IsOdd)
{
    ...

    parallel_for_each(extent<1>(size), [=](index<1> idx) restrict(amp)
    {
        int i = idx[0];
        a[i] = IsOdd(i);
    });

	...
}

void test2()
{
    auto test = [=](int t) restrict(amp) -> bool
    {
        return t % 2 = 1;
    };
    pass_as_parameter(test);
}

In C++ AMP, tiled_extent/index must use integer constants defined at compile time.

C++ AMP cannot use the C++11 standard for lambda function type definitions via “std::function<>” because C++ AMP requires the addition of the “restrict(amp)” which is not supported in std::function<>.

Campy.NET avoids many of these problems. It provides additional type checking to warn users of problems. Campy.NET allows one to define the type of the lambda expression and use that when passing functions as parameters. The Campy.NET types for tile sizes can be variable.

Campy.NET is an API, not a new language

Campy.NET is an API implemented in the C# language for parallel programming; it does not alter the compiler or runtime. Campy.NET is essentially a C# wrapper for C++ AMP.

Global memory management is seemless

Like C++ AMP, Campy.NET uses a container to wrap the access of a data structure in CPU memory. When accessed in GPU code, the data is copied to GPU memory automatically. After the Parallel_For_Each completes, the data is copied back to CPU memory only if the wrapper is accessed in CPU code.

Support for shared memory and thread synchronization

Campy.NET supports share memory and thread synchronization for efficient sub-division of large problems.

The following is an example from the NVIDIA Parallel ForAll column (http://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/), converted into Campy.NET.

using System;
using Campy;
using Campy.Types;

namespace Reverse
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a large array of numbers, some containing sixes.
            int size = 64;
            int[] data = new int[size];
            for (int i = 0; i < size; ++i) data[i] = i;
            Array_View<int> d = new Array_View<int>(ref data);

            // Initialize...
            Extent e = new Extent(size);
            AMP.Parallel_For_Each(d.Extent, (Index idx) =>
            {
                int j = idx[0];
                d[j] = j;
            });

            // Reverse...
            Tile_Static<int> s = new Tile_Static<int>(64);
            AMP.Parallel_For_Each(d.Extent.Tile(size), (Tiled_Index idx) =>
            {
                int t = idx.Local[0];
                int tr = size - t - 1;
                s[t] = d[t];
                idx.Barrier.Wait();
                d[t] = s[tr];
            });
            for (int i = 0; i < size; ++i)
                System.Console.WriteLine(d[i]);
        }
    }
}

Support for atomic operations

Campy.NET provides atomic addition, subtraction, fetch, and others, for thread synchronized access to memory. An example is the classic “sum of sixes” problem.

using System;
using Campy;
using Campy.Types;

namespace SumOfSixes
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a large array of numbers, some containing sixes.
            int size = 50000000;
            int[] data = new int[size];
            Array_View<int> d = new Array_View<int>(ref data);

            // Initialize...
            Extent e = new Extent(size);
            AMP.Parallel_For_Each(d.Extent, (Index idx) =>
            {
                int j = idx[0];
                d[j] = (j % 30) == 0 ? 6 : j;
            });

            // Count sixes...
            int[] result = new int[1];
            Array_View<int> r = new Array_View<int>(ref result);
            AMP.Parallel_For_Each(d.Extent, (Index idx) =>
            {
                int j = idx[0];
                if (d[j] == 6)
                    AMP.Atomic_Fetch_Add(ref r, 0, 1);
            });
            System.Console.WriteLine("result = " + r[0]);
        }
    }
}

Support for multi-core CPUs or GPUs

Campy.NET supports the execution of parallel code on any multi-core processor, as long as it is supported in C++ AMP. A programmer can easily select the processor to perform the parallel execution of the lambda functions.

Status

That said, the overall design and implementation of Campy.NET API are unset. While there is an effort to model it after C++ AMP, the API must be redesigned to avoid the same pitfalls of C++ AMP, such as not being able to handle pointers. See http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/19/restrict-amp-restrictions-part-0-of-n-introduction.aspx . Currently, it also relies heavily on the Visual Studio C++ compiler, which it should not. If you are interested in helping out, please let me (Ken Domino) know.

(Note: updated Feb 7, 2015 for API changes.)

Leave a Reply

Your email address will not be published. Required fields are marked *

This blog is kept spam free by WP-SpamFree.