Asynchronous data transfers
As we saw in Chapter 4, when copying data to and from the device, we need to specify the direction in which we are moving data. However, in our first examples, we waited to copy our entire dataset before starting to process the data. After processing we waited for the results to be copied back to the host, where our main program could use them. We saw, when considering the time necessary to move data, that total runtime increased significantly. Luckily there is something we can do to improve on this.
Actually, there are two things that work together to improve performance: asynchronous data transfers and streams. We will look at asynchronous data transfers first.
Being asynchronous means that once data transfer starts, control returns immediately to the CPU which is then free to run other code, for example, to gather another part of our large dataset. The function that we use for this is cudaMemcpyAsync, which conveniently is named in a way that...