Using CUDA streams to overlay operations
In Chapter 5 we had an introduction to CUDA streams, but now it’s time to learn how to use them in practice. CUDA streams are a powerful feature in NVIDIA’s CUDA programming model that allow us to execute multiple operations concurrently, increasing the throughput of GPU workloads. And this sounds confusing because we know that GPUs have a large number of processing cores and that is why they execute instructions concurrently, right? Well, that is a part of the story, but we also have another level of parallelism to explore. By properly leveraging streams, we can overlap memory transfers and kernel executions to maximize the efficiency of the hardware. This is possible because most consumer-grade GPUs have three asynchronous engines: one for kernel execution and two for memory transfers (one for host-to-device and one for device-to-host). We will discuss how to best utilize these resources given memory bandwidth limitations.
...