Answers
- Depending on your GPU, the exact values may vary, but for the vector addition problem, the dominant portion of the timeline is spent on data transfer between host and device, making it the main performance bottleneck.
- Because the simple vector addition kernel accesses memory in a contiguous, coalesced pattern, its global memory load efficiency is relatively high. In contrast, more complex kernels, such as matrix multiplication, may have less obvious coalescing.
- Using Fortran (
"F") order stores the 2D array in column-major order, aligning memory access across threads for coalescing, which improves global memory load efficiency.
Get this book's PDF version and more
Scan the QR code (or go to packtpub.com/unlock). Search for this book by name, confirm the edition, and then follow the steps on the page.


Note: Keep your invoice handy. Purchases made directly from Packt don't require an invoice.