Architecture and Design Considerations
This document is incomplete. When time permits, things will be added. In the meanwhile, the following are notes of some general considerations.
What are the major components of Campy?
Campy is partitioned into these projects.
- Campy — the user API, containing Campy.Parallel.For()
- Campy.Compiler — the JIT compiler, including code to import user CIL assembly code. Right now, LLVM and CUDA calls are sprinkled all throughout.
- Campy.Meta — a C# layer to the meta system that is written in Campy.Runtime. The meta system is based on what was provided in DotNetAnywhere, and updated extensively for additional table types in the PE file as specified by ECMA 335. The meta system must run on both the GPU and CPU: C# code on the GPU needs a meta system to keep track of references and reference types; C# code on the CPU needs to copy from the user’s data structures, in whatever framework running, into the substituted runtime framework.
- Campy.Runtime — the native code (C++/CUDA code which is run on the GPU), the CPU native C++ code (the same C++/CUDA code compiled for the CPU with a very thin export layer to expose a few functions), and the C# framework for Campy (the C# code that replaces Net Core/Framework/Mono).
- Campy.Graphs — an implementation of graphs, containing only the essentials used by Campy. (It used to be much bigger.)
- Campy.Utils — odds and ends.
- Tests/ — various unit tests for Campy, including reduction, scan, sort, etc.
- Swigged.CUDA — a separate NuGet package that exposes a C# API to the CUDA Driver API.
- Swigged.LLVM — a separate NuGet package that exposes a C# API to the LLVM-C API of LLVM. Separately, builds of LLVM are released for static linking at https://github.com/kaby76/llvm, since LLVM.org does not release any pre-built binaries of the core libraries.
What open source software does Campy use?
Any project, especially with one being developed by just one person, must rely on other works. Campy utilizes the following:
- C# NET Framework and NET Core
Why does Campy use LLVM?
Campy uses LLVM for JIT compilation of CIL code. There are several alternatives, but none offer the range of targets (both NVIDIA and AMD GPUs), and depth (optimizations and debugging additions). LLVM is under active development by many people.
One alternative which Camy used to do many years ago was translation of CIL into CUDA/C++ code, which is then compiled. Cudafy, Altimesh, and others seem to take this approach. However, this requires the compiler tool chain to be installed.
Why use DotNetAnywhere instead of Mono or Net Core or something else?
Campy requires a small Net Framework to support execution of CIL on a GPU. Unfortunately, Coreclr, Corert, Mono, NET Micro Framework (now NanoFramework) are large. In fact, I don’t want to supplant the framework that an app runs in. The hypothesis is that Campy can run any framework by copying the data structures from what ever framework is used into a substituted framework for which only the very lowest level classes are required to get the kernel to work on the GPU. The copying that Campy does is a deep copy of all data referenced in the closure object for the kernel. Dot Net Anywhere, with all its problems, was the best choice to get “up and running” as I did not want to rewrite an entire framework from scratch. The main problem in porting DotNetAnywhere to run on a GPU is that CUDA requires C/C++ functions to be declared with the __device__ modifier in order for the function to be run on a GPU. Campy requires two versions of the DotNetFramework: one for the CPU for copying, and the other for the GPU for C# managed object support.
Why does Campy need Mono.Cecil?
Although DotNetAnywhere includes a meta type system capable of reading any assembly, Mono.Cecil is used to read the user’s app CIL instructions for methods. It is a well-developed API, and superior to the DotNetAnywhere meta system and anything else out there.
What API does Campy call for CUDA and LLVM?
Campy (written mostly in C#) interfaces with CUDA and LLVM, both of which are native DLL libraries, through a SWIG-generated layer. Why SWIG? The CUDA API is a well-organized API containing over 397 functions. Of those, only about 20 are actually used in Campy. Likewise, LLVM is a very large API, and for the same reason as with CUDA. Although one could use C++ Interop (Implicit P/Invoke), I did not know which functions would be used in Campy. So, to allow for maximal flexibility, the interface is SWIG generated. Further, LLVM is not delivered as a library (DLL or SO files). Due to the unstable and kludgy nature of SWIG, it would probably be good to work out an alternative, or to use C++ Interop alone.
What space are GPU objects allocated in?
Campy copies all C# data structures used in the closure object of a Campy.Parallel.For() call into memory that the GPU can access. All objects are allocated in pinned memory on the CPU so they can be modified by both CPU and GPU. Eventually, I expect the API to change so pure GPU memory can be used as an alternative for performance issues.
Campy does not seem to including any explicit API to copy data to/from the GPU, like cudaMemcpy. Why not?
As part of the requirements, Campy does not expose things link cudaMemcpy, cudaMalloc, cudaFree, etc., because C# data objects are managed objects; cudaMalloc is untyped block allocation. While an API could expose that functionality, it is not in keeping with managed languages. That said, for performance issues, Campy offers Sticky(), Readonly(), and Sync() for performance improvements. Other mechanisms will be explored in time.
Campy doesn’t seem to include in the API a way to access GPU shared memory and thread cooperation. Why not?
Campy will be offering that at some point. I just haven’t had time yet to do that.
What is the runtime model?
Structures and classes in C# are compiled into LLVM structs, which are packed with explicit fields for padding to maintain 8-byte boundaries.
Campy compiles all methods as call-by-value. Currently, all parameters to methods are copied in auto variables on the stack at the beginning of the method. This allows the value to be modified via starg CIL instructions. Similarly, all local variables are allocated on the stack at the beginning of the method. Originally, Campy was written so that there were no local automatics, but that failed when compiling methods that contained ldarga and ldloca instructions. Eventually, I’ll add in an optimization to avoid the alloca’s.
Exceptions are not supported because CUDA does not support exceptions. For the moment, Campy generates code for the path that is exception-free.
Why aren’t you selling this?
Campy is not for sale. It is open source software. Although I am unemployed, haven’t had a regular programmer job for many years, have almost no money to live on, I can’t sell it considering the software is too specialized, and that there is other software that does something similar (Alea, Altimesh, and ILGPU).