Writing algorithms for GPU using CUDA C++
One of the hardest parts of learning to program GPUs is the massively parallel model of computation. This is radically different from the programming we’re used to on CPUs. One must think about the distribution of work up front and how to organize data movement between different elements of the kernel. Once you get used to some of these concepts, then some of the quirks of the architecture start to become apparent (usually once one starts to profile your kernels). This is not going to be a comprehensive introduction to CUDA programming, but it should give enough to get you started.
The first thing we want to do is define a kernel. A CUDA kernel is defined using the __global__ attribute applied to a function that returns void. There are, of course, restrictions on the arguments that can be passed to a kernel (since these must be copied over to the GPU). We cannot pass data by reference, and any pointer values we pass should point...