Up to this point, I had envisioned Campy as an API that would be similar to C++/AMP. However, I'm now thinking beyond that API, and looking for a simpler interface between CPU and GPU for C# and NET code than the C++/AMP model.
In particular, C++/AMP defines array<> and array_view<>, data types that abstract an array for the GPU. What I really would prefer is to be able to run C# code on a GPU with no markup/tags/wrappers/etc., and access all relevant C# data via the closure of all objects and code–including classes, structs, arrays, basic value types, etc., within the GPU that the code uses. In other words, almost zero boilerplate code involved to perform parallel computations on the GPU. However, this is easier said than done.
Take for example reduction (1). Here is a simple, clean, in-place implementation of reduction on an array of integers with binary operator '+' in a new, experimental version of Campy.
int n = Bithacks.Power2(20); int[] data = new int[n]; Extent e = new Extent(n); Campy.Parallel.For(new Extent(n), idx => data[idx] = 1); for (int level = 1; level <= Bithacks.Log2(n); level++) { int step = Bithacks.Power2(level); Campy.Parallel.For(new Extent(n / step), idx => { var i = step * idx; data[i] = data[i] + data[i + step / 2]; }); } for (int i = 0; i < data.Length; ++i) System.Console.WriteLine(data[i]);
In this code, the first parallel for-loop initializes data; in the second parallel for-loop, it performs a sum using data and step. Variables data and step are shared between the CPU and GPU. At runtime, a delegate is created containing the function of the lambda expression passed to a Parallel.For() method. The delegate contains a target object for each variable used in the lambda. i.e., the closure (2). Implicit in this code is a SIMT computing model with shared memory so data structures can be shared. The final result of the sum is contained in data[0].
This example illustrates several problems. We note that the implementation of the C# data structures can vary with different implementations of the NET runtime. The GPU code must be aware of this implementation, or it must marshal the objects into an appropriate representation for the GPU. Many operations of the C# data structures rely on the NET runtime used. So, the GPU must be able to JIT the NET runtime. Complications of sharing data include alignment issues (64-bit fetch must be aligned on 64-bit boundaries; 3, 4, 5). C# objects are allocated using the C# virtual memory manager, which we do not have any control of (6). The memory manager allocates objects from a heap, garbage collection of stale objects at any time.
As a first step into sharing data structures, let's assume we are accessing only value types. That is user-defined arrays and structures that contain only other value types–integers, characters, booleans, arrays, and structs. This limits the difficulties associated in the JIT of the NET runtime. Further, as we do not have access to the memory allocator/garbage collector, let us also assume the CIL does not contain calls to the memory system. These restrictions are assumed in ILGPU and Alea GPU.
With these limitations, a simple solution which I am working towards is that C# data structures can be converted to equivalent blittable types (7, 8), then deep copied to and from GPU memory. Unfortunately, this means a bit of copying to and from unmanaged memory, currently using memory allocated via cuMemAllocManaged. Remember, reference types–which is what most programmers use–are not handled. A simple example with Campy restricted to value types is available (9).
Alea GPU Version 3 provides a very similar interface to Campy, where kernel delegates are passed to a Parallel.For()-like routine for JITing and running. In Version 3, the attribute "[GpuManaged]" is used to alleviate the issues in synchronize data between CPU and GPU. ILGPU Version 0.1.4 does not handle reference types (i.e., classes).
–Ken Domino
Notes and References
(1) Reduction
- REDUCE(⊕, A): The REDUCE operation takes a binary operator ⊕ with identity I, and an ordered set A = [a0, a1, …, an−1] of n elements. The value of REDUCE(⊕, A) is (((a0 ⊕ a1) ⊕ … ) ⊕ an−1).
- Since addition is commutative and associative for integers, REDUCE can be computed in a parallel manner. One implementation is an in-place method, which modifies the input set. It consists of a nested for-loop within an outer for-loop as shown in the Campy code described above. The inner for-loop can be performed in parallel in pairs.
(2) Lambda Expressions (C# Programming Guide). Microsoft Documentation. https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/statements-expressions-operators/lambda-expressions
(3) Coding for Performance: Data alignment and structures. https://software.intel.com/en-us/articles/coding-for-performance-data-alignment-and-structures
(4) Data Alignment when Migrating to 64-Bit Intel® Architecture. https://software.intel.com/en-us/articles/data-alignment-when-migrating-to-64-bit-intel-architecture
(5) How Memory Is Accessed. https://software.intel.com/en-us/articles/how-memory-is-accessed
(6) https://github.com/dotnet/coreclr/issues/1235
(7) Puran Mehram, P. Managed code and unmanaged code in .NET. C# Corner. http://www.c-sharpcorner.com/uploadfile/puranindia/managed-code-and-unmanaged-code-in-net/
(8) Parsons, J. Changes to Blittable Types. GitHubGist. https://gist.github.com/jaredpar/cecc2f5fd76b70e480450296d4c9914e . Accessed Aug 29, 2017.
(9) You can look at and try the working test in Github.com. Note, it does not yet run the Reduction example due to alignment problems.
# git clone https://github.com/kaby76/Campy.git
# cd Campy; git checkout -b proof-of-concept