Campy.NET has passed a milestone: it can now approximate the value of pi. The code, below, is an implementation of the classic toy example “summing a circle’s area”.

using Campy; using Campy.Types; using System; namespace Test { class Program { class Point { public float x; public float y; } static void Main(string[] args) { int half_size = 1000; int size = half_size * half_size; Point[] data = new Point[size]; for (int i = 0; i < size; ++i) data[i] = new Point(); Array_View<Point> points = new Array_View<Point>(ref data); Extent e = new Extent(size); AMP.Parallel_For_Each(e, (Index idx) => { int i = idx[0]; points[i].x = (float)(1.0 * (i / half_size) / half_size); points[i].y = (float)(1.0 * (i % half_size) / half_size); }); points.Synchronize(); int[] insc = new int[size]; Array_View<int> ins = new Array_View<int>(ref insc); //ins.discard_data(); AMP.Parallel_For_Each(e, (Index idx) => { int i = idx[0]; float radius = 1.0f; float tx = points[i].x; float ty = points[i].y; float t = (float)Math.Sqrt(tx*tx + ty*ty); ins[i] = (t <= radius) ? 1 : 0; }); Extent e_half = new Extent(half_size); int[] count = new int[1]; count[0] = 0; Array_View<int> res = new Array_View<int>(ref count); AMP.Parallel_For_Each(e_half, (Index idx) => { int i = idx[0]; for (int j = 1; j < half_size; ++j) { int k = i * half_size; int t1 = ins[k + j]; int t2 = ins[k]; int t3 = t1 + t2; ins[k] = t3; // cannot decompile!!! ins[i * half_size] += ins[i * half_size + j]; } AMP.Atomic_Fetch_Add(ref res, 0, ins[i * half_size]); }); int cou = res[0]; System.Console.WriteLine("Count is " + cou + " out of " + size); float pi = (4.0f * cou) / size; System.Console.WriteLine("Pi is " + pi); } } }

Notes:

Although the example runs only slightly faster than the CPU equivalent implementation, there are several problems. First, unnecessary memory copy at line 21 occurs between CPU and GPU because the equivalent to *discard_data* has not yet been implemented. In addition, the last kernel uses an atomic operation to increment the count of the points within the circle (line 58). Converting the code to use the Blelloch method of reduction, and removing the unnecessary array “points” would improve the performance probably an order of magnitude or two. Alternatively, we could count the number of point *outside* the circle, then compute pi using the difference with the total number of points. In theory, this would result in less contention because there would be fewer points counted. But, this is the joy of GPU programming.

Also, in the last kernel, the ILSpy decompiler, which Campy.NET uses, has a notable problem with the code. The generated C++ AMP code cannot compile. The equivalent code is provided.