Reader small image

You're reading from  Hands-On GPU Programming with Python and CUDA

Product typeBook
Published inNov 2018
Reading LevelBeginner
PublisherPackt
ISBN-139781788993913
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Dr. Brian Tuomanen
Dr. Brian Tuomanen
author image
Dr. Brian Tuomanen

Dr. Brian Tuomanen has been working with CUDA and General-Purpose GPU Programming since 2014. He received his Bachelor of Science in Electrical Engineering from the University of Washington in Seattle, and briefly worked as a Software Engineer before switching to Mathematics for Graduate School. He completed his Ph.D. in Mathematics at the University of Missouri in Columbia, where he first encountered GPU programming as a means for studying scientific problems. Dr. Tuomanen has spoken at the US Army Research Lab about General Purpose GPU programming, and has recently lead GPU integration and development at a Maryland based start-up company. He currently lives and works in the Seattle area.
Read more about Dr. Brian Tuomanen

Right arrow

Streams, Events, Contexts, and Concurrency

In the prior chapters, we saw that there are two primary operations we perform from the host when interacting with the GPU:

  • Copying memory data to and from the GPU
  • Launching kernel functions

We know that within a single kernel, there is one level of concurrency among its many threads; however, there is another level of concurrency over multiple kernels and GPU memory operations that are also available to us. This means that we can launch multiple memory and kernel operations at once, without waiting for each operation to finish. However, on the other hand, we will have to be somewhat organized to ensure that all inter-dependent operations are synchronized; this means that we shouldn't launch a particular kernel until its input data is fully copied to the device memory, or we shouldn't copy the output data of a launched kernel...

Technical requirements

A Linux or Windows 10 PC with a modern NVIDIA GPU (2016—onward) is required for this chapter, with all necessary GPU drivers and the CUDA Toolkit (9.0–onward) installed. A suitable Python 2.7 installation (such as Anaconda Python 2.7) with the PyCUDA module is also required.

This chapter's code is also available on GitHub:

https://github.com/PacktPublishing/Hands-On-GPU-Programming-with-Python-and-CUDA

For more information about the prerequisites, check the Preface of this book, and for the software and hardware requirements, check the README in https://github.com/PacktPublishing/Hands-On-GPU-Programming-with-Python-and-CUDA.

CUDA device synchronization

Before we can use CUDA streams, we need to understand the notion of device synchronization. This is an operation where the host blocks any further execution until all operations issued to the GPU (memory transfers and kernel executions) have completed. This is required to ensure that operations dependent on prior operations are not executed out-of-order—for example, to ensure that a CUDA kernel launch is completed before the host tries to read its output.

In CUDA C, device synchronization is performed with the cudaDeviceSynchronize function. This function effectively blocks further execution on the host until all GPU operations have completed. cudaDeviceSynchronize is so fundamental that it is usually one of the very first topics covered in most books on CUDA C—we haven't seen this yet, because PyCUDA has been invisibly calling this...

Events

Events are objects that exist on the GPU, whose purpose is to act as milestones or progress markers for a stream of operations. Events are generally used to provide measure time duration on the device side to precisely time operations; the measurements we have been doing so far have been with host-based Python profilers and standard Python library functions such as time. Additionally, events they can also be used to provide a status update for the host as to the state of a stream and what operations it has already completed, as well as for explicit stream-based synchronization.

Let's start with an example that uses no explicit streams and uses events to measure only one single kernel launch. (If we don't explicitly use streams in our code, CUDA actually invisibly defines a default stream that all operations will be placed into).

Here, we will use the same useless...

Contexts

A CUDA context is usually described as being analogous to a process in an operating system. Let's review what this means—a process is an instance of a single program running on a computer; all programs outside of the operating system kernel run in a process. Each process has its own set of instructions, variables, and allocated memory, and is, generally speaking, blind to the actions and memory of other processes. When a process ends, the operating system kernel performs a cleanup, ensuring that all memory that the process allocated has been de-allocated, and closing any files, network connections, or other resources the process has made use of. (Curious Linux users can view the processes running on their computer with the command-line top command, while Windows users can view them with the Windows Task Manager).

Similar to a process, a context is associated...

Summary

We started this chapter by learning about device synchronization and the importance of synchronization of operations on the GPU from the host; this allows dependent operations to allow antecedent operations to finish before proceeding. This concept has been hidden from us, as PyCUDA has been handling synchronization for us automatically up to this point. We then learned about CUDA streams, which allow for independent sequences of operations to execute on the GPU simultaneously without synchronizing across the entire GPU, which can give us a big performance boost; we then learned about CUDA events, which allow us to time individual CUDA kernels within a given stream, and to determine if a particular operation in a stream has occurred. Next, we learned about contexts, which are analogous to processes in a host operating system. We learned how to synchronize across an entire...

Questions

  1. In the launch parameters for the kernel in the first example, our kernels were each launched over 64 threads. If we increase the number of threads to and beyond the number of cores in our GPU, how does this affect the performance of both the original to the stream version?
  2. Consider the CUDA C example that was given at the very beginning of this chapter, which illustrated the use of cudaDeviceSynchronize. Do you think it is possible to get some level of concurrency among multiple kernels without using streams and only using cudaDeviceSynchronize?
  3. If you are a Linux user, modify the last example that was given to operate over processes rather than threads.
  4. Consider the multi-kernel_events.py program; we said it is good that there was a low standard deviation of kernel execution durations. Why would it be bad if there were a high standard deviation?
  5. We only used 10 host...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On GPU Programming with Python and CUDA
Published in: Nov 2018Publisher: PacktISBN-13: 9781788993913
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dr. Brian Tuomanen

Dr. Brian Tuomanen has been working with CUDA and General-Purpose GPU Programming since 2014. He received his Bachelor of Science in Electrical Engineering from the University of Washington in Seattle, and briefly worked as a Software Engineer before switching to Mathematics for Graduate School. He completed his Ph.D. in Mathematics at the University of Missouri in Columbia, where he first encountered GPU programming as a means for studying scientific problems. Dr. Tuomanen has spoken at the US Army Research Lab about General Purpose GPU programming, and has recently lead GPU integration and development at a Maryland based start-up company. He currently lives and works in the Seattle area.
Read more about Dr. Brian Tuomanen