Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Hands-On GPU Programming with Python and CUDA

You're reading from  Hands-On GPU Programming with Python and CUDA

Product type Book
Published in Nov 2018
Publisher Packt
ISBN-13 9781788993913
Pages 310 pages
Edition 1st Edition
Languages
Author (1):
Dr. Brian Tuomanen Dr. Brian Tuomanen
Profile icon Dr. Brian Tuomanen

Table of Contents (15) Chapters

Preface 1. Why GPU Programming? 2. Setting Up Your GPU Programming Environment 3. Getting Started with PyCUDA 4. Kernels, Threads, Blocks, and Grids 5. Streams, Events, Contexts, and Concurrency 6. Debugging and Profiling Your CUDA Code 7. Using the CUDA Libraries with Scikit-CUDA 8. The CUDA Device Function Libraries and Thrust 9. Implementation of a Deep Neural Network 10. Working with Compiled GPU Code 11. Performance Optimization in CUDA 12. Where to Go from Here 13. Assessment 14. Other Books You May Enjoy

Assessment

Chapter 1, Why GPU Programming?

  1. The first two for loops iterate over every pixel, whose outputs are invariant to each other; we can thus parallelize over these two for loops. The third for loop calculates the final value of a particular pixel, which is intrinsically recursive.
  2. Amdahl's Law doesn't account for the time it takes to transfer memory between the GPU and the host.
  3. 512 x 512 amounts to 262,144 pixels. This means that the first GPU can only calculate the outputs of half of the pixels at once, while the second GPU can calculate all of the pixels at once; this means the second GPU will be about twice as fast as the first here. The third GPU has more than sufficient cores to calculate all pixels at once, but as we saw in problem 1, the extra cores will be of no use to us here. So the second and third GPUs will be equally fast for this problem.
  4. One issue with generically...

Chapter 2, Setting Up Your GPU Programming Environment

  1. No, CUDA only supports Nvidia GPUs, not Intel HD or AMD Radeon
  2. This book only uses Python 2.7 examples
  3. Device Manager
  4. lspci
  5. free
  6. .run

Chapter 3, Getting Started with PyCUDA

  1. Yes.
  2. Memory transfers between host/device, and compilation time.
  3. You can, but this will vary depending on your GPU and CPU setup.
  4. Do this using the C ? operator for both the point-wise and reduce operations.
  5. If a gpuarray object goes out of scope its destructor is called, which will deallocate (free) the memory it represents on the GPU automatically.
  6. ReductionKernel may perform superfluous operations, which may be necessary depending on how the underlying GPU code is structured. A neutral element will ensure that no values are altered as a result of these superfluous operations.
  7. We should set neutral to the smallest possible value of a signed 32-bit integer.

Chapter 4, Kernels, Threads, Blocks, and Grids

  1. Try it.
  2. All of the threads don't operate on the GPU simultaneously. Much like a CPU switching between tasks in an OS, the individual cores of the GPU switch between the different threads for a kernel.
  3. O( n/640 log n), that is, O(n log n).
  4. Try it.

  1. There is actually no internal grid-level synchronization in CUDA—only block-level (with __syncthreads). We have to synchronize anything above a single block with the host.
  2. Naive: 129 addition operations. Work-efficient: 62 addition operations.
  3. Again, we can't use __syncthreads if we need to synchronize over a large grid of blocks. We can also launch over fewer threads on each iteration if we synchronize on the host, freeing up more resources for other operations.
  4. In the case of a naive parallel sum, we will likely be working with only a small number of data points that...

Chapter 5, Streams, Events, Contexts, and Concurrency

  1. The performance improves for both; as we increase the number of threads, the GPU reaches peak utilization in both cases, reducing the gains made through using streams.
  2. Yes, you can launch an arbitrary number of kernels asynchronously and synchronize them to with cudaDeviceSynchronize.
  3. Open up your text editor and try it!
  4. High standard deviation would mean that the GPU is being used unevenly, overwhelming the GPU at some points and under-utilizing it at others. A low standard deviation would mean that all launched operations are running generally smoothly.
  5. i. The host can generally handle far fewer concurrent threads than a GPU. ii. Each thread requires its own CUDA context. The GPU can become overwhelmed with excessive contexts, since each has its own memory space and has to handle its own loaded executable code.
...

Chapter 6, Debugging and Profiling Your CUDA Code

  1. Memory allocations are automatically synchronized in CUDA.
  2. The lockstep property only holds in single blocks of size 32 or less. Here, the two blocks would properly diverge without any lockstep.
  3. The same thing would happen here. This 64-thread block would actually be split into two 32-thread warps.
  4. Nvprof can time individual kernel launches, GPU utilization, and stream usage; any host-side profiler would only see CUDA host functions being launched.
  5. Printf is generally easier to use for small-scale projects with relatively short, inline kernels. If you write a very involved CUDA kernel with thousands of lines, then probably you would want to use the IDE to step through and debug your kernel line by line.
  6. This tells CUDA which GPU we want to use.
  7. cudaDeviceSynchronize will ensure that interdependent kernel launches and mem copies...

Chapter 7, Using the CUDA Libraries with Scikit-CUDA

  1. SBLAH starts with an S, so this function uses 32-bit real floats. ZBLEH starts with a Z, which means it works with 128-bit complex floats.
  2. Hint: set trans = cublas._CUBLAS_OP['T']
  3. Hint: use the Scikit-CUDA wrapper to the dot product, skcuda.cublas.cublasSdot
  4. Hint: build upon the answer to the last problem.
  5. You can put the cuBLAS operations in a CUDA stream and use event objects with this stream to precisely measure the computation times on the GPU.
  6. Since the input appears as being complex to cuFFT, it will calculate all of the values as NumPy.
  7. The dark edge is due to the zero-buffering around the image. This can be mitigated by mirroring the image on its edges rather than by using a zero-buffer.

Chapter 8, The CUDA Device Function Libraries and Thrust

  1. Try it. (It's actually more accurate than you'd think.)
  2. One application: a Gaussian distribution can be used to add white noise to samples to augment a dataset in machine learning.
  3. No, since they are from different seeds, these lists may have a strong correlation if we concatenate them together. We should use subsequences of the same seed if we plan to concatenate them together.
  4. Try it.
  5. Hint: remember that matrix multiplication can be thought of as a series of matrix-vector multiplications, while matrix-vector multiplication can be thought of as a series of dot products.
  6. Operator() is used to define the actual function.

Chapter 9, Implementation of a Deep Neural Network

  1. One problem could be that we haven't normalized our training inputs. Another could be that the training rate was too large.
  2. With a small training rate a set of weights might converge very slowly, or not at all.
  3. A large training rate can lead to a set of weights being over-fit to particular batch values or this training set. Also, it can lead to numerical overflows/underflows as in the first problem.
  4. Sigmoid.
  5. Softmax.
  6. More updates.

Chapter 10, Working with Compiled GPU Code

  1. Only the EXE file will have the host functions, but both the PTX and EXE will contain the GPU code.
  2. cuCtxDestory.
  3. printf with arbitrary input parameters. (Try looking up the printf prototype.)
  4. With a Ctypes c_void_p object.
  5. This will allow us to link to the function with its original name from Ctypes.
  6. Device memory allocations and memcopies between device/host are automatically synchronized by CUDA.

Chapter 11, Performance Optimization in CUDA

  1. The fact that atomicExch is thread-safe doesn't guarantee that all threads will execute this function at the same time (which is not the case since different blocks in a grid can be executed at different times).
  2. A block of size 100 will be executed over multiple warps, which will not be synchronized within the block unless we use __syncthreads. Thus, atomicExch may be called at multiple times.
  3. Since a warp executes in lockstep by default, and blocks of size 32 or less are executed with a single warp, __syncthreads would be unnecessary.
  4. We use a naïve parallel sum within the warp, but otherwise, we are doing as many sums withatomicAdd as we would do with a serial sum. While CUDA automatically parallelizes many of these atomicAdd invocations, we could reduce the total number of required atomicAdd invocations by implementing...

Chapter 12, Where to Go from Here

  1. Two examples: DNA analysis and physics simulations.
  2. Two examples: OpenACC, Numba.
  3. TPUs are only used for machine learning operations and lack the components required to render graphics.
  4. Ethernet.
lock icon The rest of the chapter is locked
You have been reading a chapter from
Hands-On GPU Programming with Python and CUDA
Published in: Nov 2018 Publisher: Packt ISBN-13: 9781788993913
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}