Considering other algorithms
Before we close this chapter, we are going to look at some interesting results from other algorithms, which might seem to represent better alternatives to our first, naive implementation. Remembering the discussion about coalesced memory access from Chapter 6, it may seem natural to suppose that we would obtain higher performance if we changed our kernel so that each CUDA thread could process an entire row of our matrix. On the other hand, we might think that having each CUDA thread calculate an entire column would yield the worst performance possible. Let’s see the code for both kernels.
__global__ void matrixMulKernel_row(float *A, float *B, float *C,
int width) {
int row = threadIdx.x + blockIdx.x * blockDim.x;
if (row < width) {
for (int col = 0; col < width; col++) {
float sum = 0.0f;
for (int i = 0; i < width; i++) {
sum += A[row * width + i] * B[i * N + col];
...