Performance Strategies
We have learned many concepts and techniques for creating CUDA programs, and we have already seen some performance improvements in our small examples. However, simply converting a program to run on the GPU may not yield the desired speed-up, and indeed it may incur costs that are at odds with expectations. One example is the time needed to load data to the GPU memory. Use the right strategies, however, and you can overcome these problems.
In this chapter we revisit our old friend matrix multiplication, and discuss some ways to improve its performance. But most importantly, we’ll investigate why the changes we make help to improve performance. Along the way we shall cover the following topics:
- Introducing optimization
- Profiling with NVIDIA Nsight Compute