Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
GPU-Accelerated Computing with Python 3 and CUDA
GPU-Accelerated Computing with Python 3 and CUDA

GPU-Accelerated Computing with Python 3 and CUDA: From low-level kernels to real-world applications in scientific computing and machine learning

Arrow left icon
Profile Icon Niels Cautaerts Profile Icon Hossein Ghorbanfekr
Arrow right icon
€37.99
Paperback Mar 2026 534 pages 1st Edition
eBook
€26.99 €29.99
Paperback
€37.99
Arrow left icon
Profile Icon Niels Cautaerts Profile Icon Hossein Ghorbanfekr
Arrow right icon
€37.99
Paperback Mar 2026 534 pages 1st Edition
eBook
€26.99 €29.99
Paperback
€37.99
eBook
€26.99 €29.99
Paperback
€37.99

What do you get with Print?

Product feature icon Instant access to your digital copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Redeem a companion digital copy on all Print orders
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

GPU-Accelerated Computing with Python 3 and CUDA

1

Why GPU Programming with CUDA in Python 3?

What do blockchain and artificial intelligence (AI) have in common?

At a surface level, both technologies have, in recent years, garnered a lot of media attention and investment and formed the basis for many start-ups. But beneath these applications lies a common technological foundation: general-purpose computing on graphics processing units (GPGPUs) to accelerate massively parallel computations. While the long-term impact of AI and blockchain is yet to be felt, GPGPU has already demonstrated its immense value across a multitude of fields and application areas, despite receiving significantly less public attention.

GPU programming is traditionally taught through low-level programming languages such as C or C++. This book takes a different approach and teaches GPGPU through various libraries available in Python 3. This makes the subject more accessible to our target audience: data scientists and researchers who primarily use Python and seek to accelerate computationally intensive code. This book focuses entirely on the CUDA platform, which is the most popular GPU programming framework that runs exclusively on NVIDIA hardware.

In this chapter, we will learn what GPGPU and CUDA are and how to recognize scenarios that benefit from GPGPU. We will also learn how to estimate and measure the benefits of accelerating our computations using GPGPU. We will also review the limitations of GPGPU, because unfortunately, it is not a magic bullet that can speed up any computation:

  • Understand the benefits, application areas, and limitations of GPU computing
  • Calculate the theoretical compute capacity of devices
  • Identify which types of problems benefit from massive parallelization, and estimate performance gains with Amdahl's law
  • Recognize additional factors, such as data transfers, that influence computing performance
  • Use cProfile and Scalene to discover bottlenecks in Python code

Your purchase includes a free PDF copy + exclusive extras

Your purchase includes a DRM-free PDF copy of this book, 7-day trial to the Packt+ library (no credit card required), and additional exclusive extras. See the Free benefits with your book section in the Preface to unlock them instantly and maximize your learning.

Technical requirements

To run the code in this chapter, you will need an installation of Python 3 with the numpy, matplotlib, numba, and scalene packages installed. The recommendation is to create a pixi environment based on the pixi.lock and pyproject.toml files in the Github repository associated with this book: https://github.com/PacktPublishing/GPU-Accelerated-Computing-with-Python-3-and-CUDA.

Installation instructions can be found in the README.md file.

If you've worked with Python for a while, you may not yet be familiar with the Pixi package manager. The reasons we chose Pixi for this book are explained in Chapter 2.

What is GPGPU?

A graphics processing unit (GPU) is a specialized processor originally designed to render images on a computer screen. This requires calculating the color of millions of pixels on the screen and updating it multiple times a second. Calculating the value of each pixel is independent and can therefore be computed in parallel. GPUs were designed to perform millions of these computations in parallel to enable fast rendering of images for applications such as computer games.

A few decades ago, researchers realized they could employ the massive parallelization power of GPUs to dramatically speed up calculations unrelated to graphics. However, writing non-rendering GPU programs used to be extremely difficult and out of reach for most programmers. That was until CUDA was released.

What is CUDA?

CUDA is a platform first released in 2007 by the company NVIDIA for making GPGPU more accessible. NVIDIA is the market leader in designing chips for GPUs. CUDA includes a high-level API, a programming model, and several libraries for running non-rendering algorithms on NVIDIA GPUs. CUDA unleashed the power of GPGPU in numerous domains by enabling programmers to easily write massively parallel algorithms.

CUDA is the main enabler of GPGPU and, by extension, the AI boom. It is what makes NVIDIA hardware the de facto standard for GPGPU. This unique market position is reflected in NVIDIA's remarkable stock price performance over the past few years:

Image 1

Figure 1.1 – NVIDIA stock price evolution over time (source data: Kaggle)

Originally, CUDA was designed to be used primarily through CUDA C with NVIDIA's NVCC compiler. As a result, CUDA programming has traditionally been taught using C or C++. Python libraries seeking to leverage GPU acceleration often had to wrap CUDA C code, which adds complexity.

In recent years, NVIDIA has invested significantly in making CUDA more accessible directly from Python, broadening its appeal to a wider audience of programmers. These efforts are centralized in the CUDA Python project (see https://nvidia.github.io/cuda-python/latest), a collection of tools and libraries that enable Python developers to write and interact with CUDA code more seamlessly.

In the first half of this book, we will focus on numba.cuda, a key component of the CUDA Python ecosystem. Numba is a just-in-time (JIT) compiler for Python, and numba.cuda is an extension that allows us to write CUDA-like code using Python syntax. This approach eliminates the need to mix multiple programming languages, enabling us to write all our code in familiar Python. Still, numba.cuda allows us to write low-level code, which will help you understand what goes on under the hood.

In the second half of the book, we will introduce high-level libraries that abstract these primitives and simplify common functionality such as array programming, data manipulation, and machine learning. In the final chapters, we will integrate these concepts and explore real-world applications, including image processing and atomistic simulations. CUDA is the foundation for nearly all the tools discussed throughout this book.

Why GPGPU is attractive

Like a CPU, a GPU contains many cores: physical processing units capable of performing arithmetic operations at extremely high speeds. Think of each core as a tiny calculator, executing simple operations rapidly. However, the key difference lies in scale and design: while a modern CPU typically features a few complex cores optimized for fast, sequential execution, a GPU may contain thousands of simpler cores specifically designed for massively parallel execution. When a problem can be divided into many smaller, independent tasks, and those tasks can be distributed across all these GPU cores, dramatic speedups can be achieved compared to traditional CPU execution.

To quantify this difference, we can calculate the theoretical maximum compute speed of a device in floating-point operations per second (FLOPS) using the following equation:

B18558_01_001.png

The clock frequency is the internal metronome of the device that indicates how fast it can perform computational work, a bit like a heartbeat. It is expressed in cycles per second, or Hz. The number of clock cycles required for a single operation depends on the type of operation we want to perform (e.g., 32-bit versus 64-bit floating-point numbers) and the architecture of the device; for some operations on some devices, it is possible to perform multiple in a single cycle.

Let's use this equation to compare the theoretical compute capacity for 32-bit floating-point numbers of an Intel Core i9 processor CPU, which has 24 cores, to that of an NVIDIA GeForce RTX 4080 GPU with 9,728 cores; both are consumer-grade hardware. The CPU has a clock frequency of 3.5 GHz, and with the AVX instruction set, each core can perform 16 floating-point operations per clock cycle. By plugging these values into the equation, we can estimate that the CPU has a theoretical compute capacity of a little over 1 TerraFlop (TFLOP). The GPU has a slightly lower clock frequency of 2,210 MHz, and the cores can only perform 2 operations per cycle. Still, because the GPU has so many more cores than the CPU, the theoretical compute capacity is a whopping 43 TFLOPS!

Despite the GPU's reputation as being an expensive and power-hungry device, it is much more economical and energy efficient than the CPU when measured per TFLOP. At a retail price of around 1,100 USD and a power draw of up to 300 W, the RTX 4080 GPU costs 26 USD and consumes 7 W per TFLOP. The Intel processor retails around 500 USD and draws up to 125 W; the GPU is therefore roughly 20 times cheaper and energy efficient when measured per TFLOP!

When is GPGPU useful?

The calculation of theoretical compute capacity illustrates that the biggest factor in GPU compute power is the massive number of cores. CPUs generally outperform GPUs in terms of clock frequency and number of operations per clock cycle. This means that if we only use a single GPU core, the calculation will be slower than on the CPU. Only code that can make simultaneous use of a large number of cores will benefit from GPGPU.

GPU cores are not interchangeable with CPU cores. Whereas CPU cores operate more or less independently from each other, GPU cores are grouped into larger units called streaming multiprocessors (SMs). Cores in the same SM do not operate independently and execute instructions in a coordinated manner; we will explore the details in Chapter 5. The consequence is that a GPU is best suited for one specific type of parallelism: data parallelism.

Work can be parallelized in two fundamental ways: task parallelism and data parallelism. Task parallelism involves executing distinct and independent tasks concurrently across different workers (e.g., CPU cores). Data parallelism, on the other hand, divides a single task into identical operations executed simultaneously across multiple data elements, such as applying the same operation to every element in an array.

GPUs are optimized for data parallelism because their architecture allows thousands of threads (sequences of instructions) to execute the same instruction in lockstep, maximizing throughput for uniform workloads such as matrix operations or image processing. We will see in Chapter 3 that the data parallelism model is almost baked into the CUDA programming model.

GPUs struggle with task parallelism and divergent execution paths. When threads within a warp (a group of 32 GPU threads) follow different control flows (such as taking different branches in an if-else statement), performance degrades significantly due to a phenomenon called warp divergence. We will explain warps and warp divergence in more detail in Chapter 5.

 

Task parallelism on the GPU

There are strategies for employing task parallelism on the GPU using techniques such as warp specialization: assigning different tasks to different warps. However, this can result in very complex code and is reserved for advanced use cases.

Application areas of GPGPU

In the previous section, we mentioned that the GPU excels at accelerating computations by leveraging data parallelism. It turns out that many problems from diverse fields benefit from this approach. In recent years, deep learning, AI, and blockchain have received significant attention, but other application areas include the following:

  • Scientific computation and simulations in fields such as quantum mechanics, fluid dynamics, and astrophysics
  • Bioinformatics, analysis of genetic data, and predictions of protein structures (e.g., AlphaFold)
  • Signal processing: image, video, and audio
  • Computer vision and object detection
  • Medical imaging and CT reconstruction
  • Computational finance
  • Operations research

We aim to illuminate some of these application areas through example cases in this book.

Estimating the benefits of parallelization

While writing code on the GPU is certainly easier than it once was, writing CUDA code is still not easy. Converting a straightforward serial implementation of an algorithm into a massively parallel version that runs on the GPU takes development time. Therefore, it is important to be able to estimate up front whether the speedups we can gain at runtime are worth the effort.

 

A very naive approach to calculate speedup would be to divide the theoretical compute capacity of the GPU by the compute capacity of a single CPU core. In our example from the previous section, we should expect a speedup of 43 TFLOPS / 0.056 TFLOPS ≈ 770× for 32-bit floating-point operations when translating an algorithm from a serial implementation that runs on a single core of the Intel Core i9 CPU to a parallelized implementation that runs on the RTX 4080 GPU. Impressive! Unfortunately, this is very unrealistic.

Not all work is parallelizable; there will always be some serial operations in any code. The amount of code that cannot be parallelized has a massive impact on the maximum speedup. This can be illustrated with Amdahl's law, which we will now discuss.

Using Amdahl's law for speedup estimation

Suppose we have some code that requires a time, t, to run if we do not use any parallelization. Let's assume our code can be split into two portions: a parallelizable part and a non-parallelizable part. The fraction of code that is parallelizable we will denote as p, which means that 1-p is the fraction of non-parallelizable code. Therefore, in the serial implementation, the time taken by the parallelizable part is pt and (1-p)t for the non-parallelizable part.

Now, suppose we add compute resources that speed up the parallelizable part of the code by a factor, s. The time to run the non-parallelizable code is not affected. This means that the total runtime with parallelization ts becomes the following:

B18558_01_002.png

The speedup we gain from adding compute resources is as follows:

B18558_01_003.png

This equation is known as Amdahl's law. It can give us a much more sobering estimate of the speedups we can expect. Figure 1.2 shows a plot of expected speedup, t/ts, versus the speedup factor of the parallel part, s. Note that the x axis has a logarithmic scale. The different lines represent different values of p.

Image 2

Figure 1.2 – Amdahl's law showing the total speedup versus the speedup of the parallel part for different parallelizable fractions

All curves, no matter what p is, plateau out, which means that adding more workers to speed up the parallel part of the code has diminishing returns. The reason we always reach a plateau is that as s becomes large, the fraction p/s in the denominator becomes small. In the limit as s goes to infinity, the maximum speedup we can reach is 1/(1-p). When only 50% of the code is parallelizable (p=0.5), the maximum speedup we can ever reach by adding an infinite number of workers is 2. Even when 99% of our code is parallelizable (p=0.99), we eventually reach a plateau near 100.

If we bring this back to our GPU/CPU example, the most optimistic estimate for s is 770. We can either draw a vertical line on the plot at 770 or plug s=770 into the equation to estimate more realistic speedups if we know p. The takeaway is that in many cases, the most important factor in determining our speedup is the fraction of code that cannot be parallelized, as it sets our speedup ceiling. Depending on this ceiling, we could make a determination on whether implementing a GPU version of the code is worthwhile at all.

How can we estimate p? First, we need to have a good understanding of our algorithms so that we know which parts can be parallelized. Second, we need to know the time taken by each part of our algorithm. We can figure this out using code profiling.

Profiling code for generating a Julia set fractal

Let's work through an example. We will calculate an image of a Julia set fractal. Julia sets are sets of complex numbers defined by a recursive relation. When plotted as an image, they take the shape of a beautiful fractal.

To define the Julia set, first, we define the infinite recursive sequence B18558_01_004.png for complex numbers zn, zn+1, and c and natural number n. Julia sets are defined as the set of all complex numbers that start the sequence z0 for which the modulus |zn| does not diverge to infinity as n goes to infinity. In practice, when any term |zn| ≥ R with R a cut-off radius, we know that the modulus diverges. c is a constant we choose, and it defines how the Julia set looks. We can generate beautiful fractal images by coloring all z in the complex plane based on the number of iterations until the sequence diverges.

For the not-so-mathematically inclined reader

Don't worry if the mathematical details of the Julia set are too complicated; it doesn't really matter. Just think of the Julia set as a map where each point (pixel) is colored based on how quickly a simple mathematical rule causes it to 'escape' to infinity. Points that escape slowly form the intricate, beautiful patterns of a fractal. We use the Julia set example because it can be easily parallelized and produces nice images.

A Python function that generates an image of a Julia set can be implemented rather naively as follows:

import numpy as np
from numpy.typing import NDArray

def calculate_julia(
    z_0: NDArray[np.complex64],
    c: complex,
    max_iter: int,
) -> NDArray[np.int32]:
    julia_set = np.zeros(z_0.shape, dtype=np.int32)
    R = 0.5 + 0.5 * np.sqrt(1 + 4 * abs(c))

    for i in range(z_0.shape[0]):
        for j in range(z_0.shape[1]):
            z = z_0[i, j]
            for iteration in range(1, max_iter + 1):
                if abs(z) > R:
                    julia_set[i, j] = iteration
                    break
                z = z ** 2 + c
    return julia_set

The arguments we supply to the function are the 2D input array of complex numbers, z_0, the complex constant, c, and the maximum number of times we will repeat the iteration to check whether the number is part of the set.

In the function, first, we allocate a zeroed julia_set output image, the size of which is determined by our input array. Then, the cut-off radius, R, is calculated from c according to a known formula.

In a double loop, we iterate over all elements of z_0. In the inner loop, z is repeatedly updated using the aforementioned recursive expression, a maximum of max_iter times. When |z| > R, we know the sequence diverges and z_0 is not part of the Julia set; thus, we set the corresponding element in the output array to the number of iterations until divergence and break out of the loop. If we do not break out of the loop, then the value of that coordinate in julia_set remains 0, and we assume the number is part of the Julia set after a maximum number of iterations.

Before we can run the calculation, we must first create our 2D z_0 array as follows:

a = np.linspace(-1.7, 1.7, 1200, dtype=np.float32)
b = np.linspace(-1.7, 1.7, 1200, dtype=np.float32)
A, B = np.meshgrid(a, b)
z_0 = A + 1j * B

We can then calculate the Julia set for a chosen c=-0.8+0.156j, as follows:

julia_set = calculate_julia(
    z_0,
    c=-0.8+0.156j,
    max_iter=3000,
)

We can visualize julia_set by mapping each value in the array to a color, for example, using the imshow function in matplotlib:

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 6))
im = ax.imshow(julia_set, cmap="magma",
               vmax = 0.25 * np.max(julia_set))

We specify vmax to increase the contrast. Then, we can save the image to a file using the following:

fig.savefig("path/to/file")

The result is this intricate fractal image:

Image 3

Figure 1.3 – Julia set with c=-0.8+0.156j

Calculating this fractal image takes a bit of time (on this machine, it took about 14.5 seconds). This is because we calculate each pixel sequentially. However, each pixel is independent, so we could potentially calculate all of them in parallel.

Creating our z_0 input array and allocating memory for the julia_set output array are sequential operations, and so are plotting and saving our image to disk. We can time all these different parts using time.perf_counter to estimate how much our code would benefit from parallelization. To do this, we can wrap the relevant pieces of code as follows:

import time
start = time.perf_counter()
# code we want to time here
end = time.perf_counter()
print(end - start)

On the machine where these code examples were run, creating z_0 takes about 0.005 seconds, and plotting and saving the result takes 0.1 seconds. Calculating the Julia set takes about 13.8 seconds. The time to allocate a zeroed julia_set array is negligible. Therefore, if we assume that calculate_julia is fully parallelizable and the rest is fully serial, then p = 13.8/(13.8+0.105) = 99.2%. That means, according to Amdahl's law, by adding an infinite number of compute resources, we could only reach a speedup of 1/(1-0.992)=125×. All these numbers will vary depending on the specific hardware on which the examples are run.

Performing more advanced profiling with cProfile and Scalene

In this simple example, we could manually identify the interesting pieces of code to profile. As our applications get more complex, this approach does not scale. Therefore, it is important to be aware of profilers such as cProfile and Scalene.

cProfile is a profiler that is part of the Python standard library. If our Python code is in a script, we can profile the script using the command line as follows:

python -m cProfile -s cumtime script.py

We use the -s flag to sort by cumtime, which is the total time spent in the function, including sub-functions. When we profile only the code that calculates the Julia set and exclude the code that plots and saves the result, we get a table that looks something like this (rows are split over two lines to fit on the page):

34137406 function calls (34135254 primitive calls) in 21.307
seconds
   Ordered by: cumulative time
   ncalls  tottime  percall  cumtime  percall \
filename:lineno(function)
     96/1    0.000    0.000   21.307   21.307 \
{built-in method builtins.exec}
        1    0.003    0.003   21.307   21.307 \
cprofile.py:1(<module>)
        1   16.710   16.710   21.168   21.168 \
cprofile.py:5(calculate_julia)
 34040305    4.459    0.000    4.459    0.000 \
{built-in method builtins.abs}
    ...

We get a table that shows how often each function was called (ncalls), what the total time was spent in each function over all calls (tottime and cumtime), and how much time was spent per call of each function (percall). The difference between tottime and cumtime is that tottime tries to exclude the time spent in sub-functions. For example, cProfile measures the built-in abs function, which is used inside the calculate_julia function. According to cProfile, evaluating |z| takes about 4.5 seconds or roughly 20% of the total runtime. The total time of calculate_julia is the time taken by abs subtracted from the cumulative time.

Notice how running the code with the profiler is substantially slower than running it without the profiler. This is because profiling adds a tiny bit of overhead to each function call, and in this example, there are a lot of small function calls, which makes the overhead very large. With the profiler, this code takes more than 50% more time than without (21 seconds versus 14 seconds), which means that these results are not very reliable in an absolute sense; a manual timing of specific code sections will give more accurate results. Still, cProfile can give us quick insight into the bottlenecks that exist in our code. It is convenient and always available in the Python standard library.

Using cProfile in Jupyter Notebooks

When working in a Jupyter notebook, we can use the %%prun magic command at the start of a cell to profile that cell with cProfile. However, the Jupyter, IPython processes, and interactive backends from matplotlib can interfere with the profiler and skew the results. We highly suggest using cProfile outside Jupyter notebooks and not profiling matplotlib code with it.

cProfile is limited to timing function calls, and the results are not always reliable, as our example shows. We want to get more insight into the bottlenecks of our calculate_julia function, but this is not possible with cProfile unless we split our function into more useful subfunctions.

In this case, a line profiler is more appropriate. With line profilers, individual lines of Python are timed. Scalene is a powerful profiler that can do line profiling. In addition, it performs memory profiling, GPU profiling, and measures how much time is spent in Python versus "native" (i.e., compiled) code. It also claims to add a minimal amount of runtime overhead, thereby providing reliable timings. It can also be easily used from Jupyter notebooks. It is a third-party package that can be installed with any Python package manager, such as pip, conda, or pixi.

To use Scalene from a Jupyter notebook, first load the extension:

%load_ext scalene

Then, we can profile code by adding the following to the top of the cell that needs to be profiled:

%%scalene

Alternatively, we can put our code in a .py script file and run it with scalene my_script.py. Note that in both cases, for line profiling to be useful, we need all the relevant code in the same script or Jupyter notebook cell. We will get output that looks something like this:

Image 4

Figure 1.4 – Line profiling results obtained with Scalene

The leftmost column shows the percentage of time spent by the Python interpreter. The results show that almost 80% of the time of all our code is spent on the line that updates z in the calculate_pixel function. We could not derive from the results of cProfile, as this line is not a function. We also see that a good chunk of time is spent on evaluating the abs function, which is in line with the cProfile results.

Using this type of information, we can identify bottlenecks in our code, which should be the first target for optimization.

Understanding the factors that limit parallelism

In the previous section, we made an important omission about Amdahl's law: the fraction of parallelizable code is not a constant but depends on the size of our inputs. Obviously, if we calculate a smaller image and use fewer iterations, the time to calculate the Julia set decreases, and so does the fraction of parallelizable code. The larger our problem, the larger the fraction of parallelizable code, the more we will benefit from parallelization. Therefore, we don't only need to understand the code; we also have to know what the size of the problem will be.

Secondly, Amdahl's law assumes we can infinitely speed up the parallel part by adding more resources. However, in our Julia set example, it is clear that there is no benefit to adding more workers than there are pixels. The number of tasks is a discrete number, and so is the number of workers. The best we can do is to have one worker per task, in which case the runtime is reduced to the slowest-running task. If some workers get more work than others, some workers will be idle as they wait for busy workers to catch up. This is relevant in our Julia set example, since some pixels converge in just one iteration, while others require the full max_iter iterations to complete.

Third, Amdahl's law only applies to embarrassingly parallel problems, for which tasks have no interdependencies. The expected speedup of the parallel part is assumed to be linearly related to the compute capacity. This is the case for our Julia set example: each pixel can be calculated independently. However, for other problems, tasks may have interdependencies, in which case they need to wait for each other or communicate with each other. Communication and synchronization add overhead, which reduces the effectiveness of parallelization.

Finally, Amdahl's law assumes the bottleneck in running parallelizable code is the compute capacity. However, many problems are memory-bound (i.e., the bottleneck is the speed at which cores can get the data they need). The speed of these data transfers is governed by two key concepts: latency and bandwidth. Latency refers to the delay between requesting data and receiving it; imagine waiting for a single package to arrive by mail. Bandwidth, on the other hand, is the maximum rate at which data can be transferred, such as the number of packages a delivery truck can carry in an hour. For small amounts of data, latency dominates the total transfer time, while for large amounts of data, bandwidth becomes the limiting factor. If workers are waiting for data, then adding more workers will not speed anything up. Memory bottlenecks are so important to many problems that all modern processors, including both CPUs and GPUs, incorporate multiple layers of cache to optimize data locality, keeping frequently accessed data close to the compute units. When we start to optimize code, it is essential to figure out whether our problem is compute-bound or memory-bound.

Despite its limitations, Amdahl's law is a useful rule of thumb that can help us estimate an upper bound on the expected benefit of adding more workers. However, due to the aforementioned reasons, such as memory bottlenecks and synchronization, performance rarely scales linearly with compute capacity. Therefore, the only way to measure the benefits of parallelization is to run the code with different levels of parallelism.

Measuring the benefits of parallelization empirically

Let us measure the effect of adding workers for our Julia set example using numba, and illustrate why Amdahl's law is still too optimistic.

As mentioned earlier in the chapter, numba is a JIT compiler for Python. In the next few chapters, we will focus on the numba.cuda plugin, which allows us to use Numba to write CUDA code. Here, we will compile Python code to machine code for execution on the CPU and parallelize it over CPU cores.

To compile a Python function with numba, we use the njit decorator on the function as follows:

from numba import njit, prange
@njit
def calculate_julia(...):
    ...

When we invoke the function for the first time, numba will figure out the types we pass into the function and compile the function to a fast machine code version specialized to those types. Compiling a function takes a bit of time, especially with complex functions. However, compiled versions of the function are cached, so executing the function a second time will be very fast. Whenever we pass different types into the function, numba has to compile another version of the function.

njit versus jit

The jit decorator can also be used for compiling a function. The difference with njit is that jit allows for "partial" compilation. If Numba cannot compile all parts of a function, it will fall back on the Python interpreter for those bits of code. This has a large negative impact on performance. With njit, which stands for no-python JIT, we disallow this behavior and force Numba to throw an error when it cannot compile the entire function.

numba supports auto-parallelization of loops using prange. When we replace range with prange in a loop, we tell numba that it can distribute different iterations of that loop among multiple workers, in this case, CPU cores. Remember, no GPU is involved at this point. If our CPU has only a single core, there will be no parallelization. To enable parallelization, we must pass the parallel=True argument to the @njit decorator; otherwise, prange behaves identically to range. In our implementation of calculate_julia, only the range of the outer loop was replaced with prange. We could also put prange in the second loop, but profiling showed that this has no effect on performance. Hence, the JITted calculate_julia function looks as follows:

@njit(parallel=True)
def calculate_julia_jit(
    z_0: NDArray[np.complex64],
    c: complex,
    max_iter: int,
) -> NDArray[np.int32]:
    julia_set = np.zeros(z_0.shape, dtype=np.int32)
    R = 0.5 + 0.5 * np.sqrt(1 + 4 * abs(c))
    for i in prange(z_0.shape[0]):
        for j in range(z_0.shape[1]):
            z = z_0[i, j]
            for iteration in range(1, max_iter + 1):
                if abs(z) > R:
                    julia_set[i, j] = iteration
                    break
                z = z ** 2 + c
    return julia_set

This code was run and timed using one, two, four, and eight threads, using numba.set_num_threads, on a CPU with eight cores. The results are plotted in Figure 1.5, which shows execution time, speedup (ratio of execution time with one worker and the execution time with n workers), and efficiency (ratio of speedup and the number of workers):

Image 5

Figure 1.5 – Time, speedup, and efficiency of calculate_julia as a function of the number of workers

The measured times are compared to what would be expected based on ideal scaling (i.e., doubling the number of workers halves the execution time). The results show the diminishing returns of adding workers. Whereas going from one to two cores very closely follows ideal scaling (efficiency is close to one), adding any more workers barely improves the runtime.

These timings only take into consideration the calculate_julia_jit function, which we considered to be almost 100% parallelizable. Therefore, the ideal scaling lines should closely reflect what Amdahl's law predicts. Our measurements show that even Amdahl's law is far too optimistic, and many more factors are at play, as mentioned in the previous section.

JIT compilation performance

Compiling the function with numba.njit drastically reduced the runtime from around 14 seconds to around 0.35 seconds when using a single CPU thread. This impressive performance gain can be achieved because Numba circumvents the Python interpreter entirely and compiles the code to machine code with near-C performance.

This does not mean that there is no point in using the GPU where we could leverage thousands of threads/cores. However, we cannot simply extrapolate the CPU graph for the GPU, since we cannot directly compare CPU and GPU implementations. CPU and GPU threads are different, the flow of data is different, and the execution models are different.

To illustrate, we ran a GPU implementation of the calculate_julia algorithm on an A100 GPU, which has over 6,000 CUDA cores. We will learn how to write GPU implementations of algorithms in Chapter 3; in this chapter, the goal is only to illustrate a point on performance. Just like in the CPU implementation, we ran it multiple times on the GPU with different numbers of threads, starting with one and going up to about four million. The execution times versus the number of threads are shown in Figure 1.6 on a log-log graph. For comparison, the CPU execution times, which we measured earlier, are also plotted, in addition to the times that would be expected given ideal scaling. One arrow indicates the number of CUDA cores available on the device; the other arrow points to the number of array elements that need to be calculated, which is equivalent to the number of tasks that can be run in parallel.

Image 6

Figure 1.6 – Comparison of execution time between CPU and GPU implementations for different numbers of threads

The figure highlights that a CPU thread cannot be compared to a GPU thread. With a small number of threads, the GPU is heavily underutilized, and the performance is almost 100× worse compared to one CPU thread. Only when we launch about 500 GPU threads is the performance comparable to one CPU thread. Performance peaks when the number of threads is approximately equal to the number of parallelizable tasks (roughly 1.5M). In this case, we achieve a performance that is 20× better than 8 CPU threads and almost 50× better than one CPU thread. This is good, but almost an order of magnitude removed from what we might expect based on the difference in theoretical compute capacity: the A100 GPU has a compute capacity of 19.5 TFLOPS for 32-bit floats, whereas one CPU core has a compute capacity of roughly 0.05 TFLOPS; a difference of almost 400×.

When we run the calculation with more threads than tasks, execution time increases slightly compared to the minimum. While GPU threads incur only a small overhead, launching threads that have nothing to do obviously decreases performance.

Typically, we do not think of the GPU as a collection of cores, but as a single powerful processor that we aim to fully occupy. A rule of thumb and a good starting point to fully occupy the GPU is to launch a thread for each parallelizable task. The GPU is best equipped to schedule these threads efficiently on the available cores.

This analysis omits a lot of important details, since CUDA does not allow us to simply select the number of threads we want to use to execute some code. CUDA kernel execution involves additional concepts like grids and blocks. We will learn all about that in Chapter 3.

The main point of this section is to demonstrate that the most reliable way to figure out how much parallelization benefits our code is to measure its effects. Even code that we believe to be "fully parallelizable" may have additional bottlenecks that limit the effectiveness of additional workers.

Summary

In this chapter, we discussed what GPGPU is, what the motivation behind it is, and what the application areas are. We explained that GPU computing is valuable for problems that can be solved with data parallelism, i.e., dividing the problem into small chunks and running the same task on all those chunks. A GPU has a high theoretical compute capacity due to its massive number of cores. If a problem cannot be split up and distributed over multiple cores, the GPU will not speed up the computation.

After considering theoretical compute capacity, we refined our estimates of code speedup using Amdahl's law and showed that the fraction of non-parallelizable code eventually dominates. We illustrated this with an example of calculating a Julia set fractal. We profiled the code and related the results back to Amdahl's law. We also discussed the limitations of Amdahl's law and other factors that limit parallelism. We demonstrated the effect of these factors on the performance of the Julia set example by measuring parallelization efficiency. In the process, we briefly learned about Numba, JIT compilation, and parallelization over CPU cores. We also compared our CPU implementation against a GPU implementation, which showed that we cannot simply compare a CPU thread and a GPU thread.

Even though we have not learned how to program a GPU, we are now well equipped to estimate, measure, and recognize the limitations of using a GPU for speeding up computations. In the next chapter, we will learn how to set up an environment that will allow us to write and execute CUDA code.

Questions

  1. If we have a problem that is 90% parallelizable, what is the maximum speedup we can achieve? How many compute resources would be needed to achieve 90% of this limit? How many compute resources are needed to reach 99% of this limit?
  2. For the first problem, we only achieve 50% of the theoretical maximum speedup. What could be the reasons?
  3. Can we parallelize the third (inner) loop in the Julia set function? Why or why not?
  4. How can numba JIT compilation speed up Python code by a factor of >100?
  5. Can we do line profiling on JIT-compiled functions?
  6. Why does the execution time of the GPU code decrease even when there are more threads than CUDA cores? Why is this typically not the case for CPU threads?

Answers

  1. The non-parallelizable part of the code is 10% or 0.1. Therefore, according to Amdahl's law, the maximum speedup is 10×. If we want to reach 90% of that, we want to achieve a speedup of 9×. If we plug this into Amdahl's law again and solve for s, we need to speed up the parallelizable part 81×, so if we assume linear scaling, we need 81× the amount of compute resources. To reach 99% of the maximum speedup, we want to reach a speedup of 9.9×. This corresponds to s=891; in other words, we need 891× the amount of compute resources. In reality, we would need even more resources to account for imperfect scaling.
  2. Memory bottlenecks, coordination/synchronization/communication between workers, and workload imbalance between workers.
  3. No. The loop serves to sequentially update a single variable. If we distribute this over multiple threads, we will get completely non-deterministic behavior as different threads compete to try to update z.
  4. The key is that, while the Numba decorated function looks like a Python function, it no longer is a Python function. Numba compiles the function to machine code, which is similarly efficient as compiled C and Fortran.
  5. No. Because the JITed function is no longer running on the Python interpreter, it is no longer being run line by line, and line profiling becomes meaningless.
  6. GPU threads are very lightweight, and the GPU is very good at efficiently scheduling those threads and switching between them. When there is more work to be done than there are threads, we are trying to do the job of the scheduler, because we are instructing our threads to do multiple tasks that could be parallelized. CPU threads are heavyweight, and typically, there is no benefit to adding more threads than the number of cores.

Get this book's PDF version and more

Scan the QR code (or go to packtpub.com/unlock). Search for this book by name, confirm the edition, and then follow the steps on the page.

Image 7
Image 8

Note: Keep your invoice handy. Purchases made directly from Packt don't require an invoice.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Build a solid foundation in CUDA with Python, from kernel design to execution and debugging
  • Optimize GPU performance with efficient memory access, CUDA streams, and multi-GPU scaling
  • Use JAX, CuPy, RAPIDS, and Numba to accelerate numerical computing and machine learning
  • Create practical GPU applications, from PDE solvers to image processing and transformers

Description

Writing high-performance Python code doesn’t have to mean switching to C++. This book shows you how to accelerate Python applications using NVIDIA’s CUDA platform and a modern ecosystem of Python tools and libraries. Aimed at researchers, engineers, and data scientists, it offers a practical yet deep understanding of GPU programming and how to fully exploit modern GPU hardware. You’ll begin with the fundamentals of CUDA programming in Python using Numba-CUDA, learning how GPUs work and how to write, execute, and debug custom GPU kernels. Building on this foundation, the book explores memory access optimization, asynchronous execution with CUDA streams, and multi-GPU scaling using Dask-CUDA. Performance analysis and tuning are emphasized throughout, using NVIDIA Nsight profilers. You’ll also learn to use high-level GPU libraries such as JAX, CuPy, and RAPIDS to accelerate numerical Python workflows with minimal code changes. These techniques are applied to real-world examples, including PDE solvers, image processing, physical simulations, and transformer models. Written by experienced GPU practitioners, this hands-on guide emphasizes reproducible workflows using Python 3.10+, CUDA 12.3+, and tools like the Pixi package manager. By the end, you’ll have future-ready skills for building scalable GPU applications in Python.

Who is this book for?

Python developers, (data) scientists, engineers, and researchers looking to accelerate numerical computations without switching to low-level languages. This book is ideal for those with experience in scientific Python (NumPy, Pandas, SciPy) and a basic understanding of computing fundamentals who want deeper control over performance in GPU environments.

What you will learn

  • Understand GPU execution, parallelism, and the CUDA programming model
  • Write, launch, and debug custom CUDA kernels in Python with CUDA
  • Profile GPU code with NVIDIA Nsight and optimize memory access
  • Use CUDA streams and async execution to overlap compute and transfers
  • Apply JAX, CuPy, and RAPIDS to numerical computing and machine learning
  • Scale GPU workloads across devices using Dask and multi-GPU strategies
  • Accelerate PDE solvers, simulations, and image processing on the GPU
  • Build, train, and run a transformer model from scratch on the GPU
Estimated delivery fee Deliver to Ireland

Premium delivery 7 - 10 business days

€23.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Mar 31, 2026
Length: 534 pages
Edition : 1st
Language : English
ISBN-13 : 9781803245423
Category :
Languages :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Redeem a companion digital copy on all Print orders
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Ireland

Premium delivery 7 - 10 business days

€23.95
(Includes tracking information)

Product Details

Publication date : Mar 31, 2026
Length: 534 pages
Edition : 1st
Language : English
ISBN-13 : 9781803245423
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Table of Contents

23 Chapters
Part 1: Fundamentals of GPU programming with CUDA in Python 3 Chevron down icon Chevron up icon
Chapter 1: Why GPU Programming with CUDA in Python 3? Chevron down icon Chevron up icon
Chapter 2: Setting Up a GPU Programming Environment Locally and in the Cloud Chevron down icon Chevron up icon
Chapter 3: Writing and Executing CUDA Kernels with Numba-CUDA Chevron down icon Chevron up icon
Chapter 4: Profiling and Debugging CUDA Code Chevron down icon Chevron up icon
Part 2: Performance Optimization and Advanced CUDA Topics Chevron down icon Chevron up icon
Chapter 5: Optimizing the Performance of CUDA Code Chevron down icon Chevron up icon
Chapter 6: Enabling Concurrency Using CUDA Streams Chevron down icon Chevron up icon
Chapter 7: Scaling to Multiple GPUs Chevron down icon Chevron up icon
Part 3: Using High-Level Python Libraries for GPU Computation Chevron down icon Chevron up icon
Chapter 8: Bringing NumPy and SciPy to the GPU with CuPy Chevron down icon Chevron up icon
Chapter 9: Bringing pandas and scikit-learn to the GPU with Rapids Chevron down icon Chevron up icon
Chapter 10: Solving Optimization Problems on the GPU with JAX Chevron down icon Chevron up icon
Part 4: Real-World Example Applications Chevron down icon Chevron up icon
Chapter 11: Solving the Heat Equation on the GPU Chevron down icon Chevron up icon
Chapter 12: Image Processing and Computer Vision on the GPU Chevron down icon Chevron up icon
Chapter 13: Simulating Atomic Interactions on the GPU Chevron down icon Chevron up icon
Chapter 14: Implementing Your Own Transformer-Based Language Model Chevron down icon Chevron up icon
Part 5: Beyond This Book Chevron down icon Chevron up icon
Chapter 15: Expanding and Deepening Your GPU Programming Knowledge Chevron down icon Chevron up icon
Chapter 16: Unlock Your Exclusive Benefits Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the digital copy I get with my Print order? Chevron down icon Chevron up icon

When you buy any Print edition of our Books, you can redeem (for free) the eBook edition of the Print Book you’ve purchased. This gives you instant access to your book when you make an order via PDF, EPUB or our online Reader experience.

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
Modal Close icon
Modal Close icon