GPU-Accelerated Computing with Python 3 and CUDA

1

Why GPU Programming with CUDA in Python 3?

What do blockchain and artificial intelligence (AI) have in common?

At a surface level, both technologies have, in recent years, garnered a lot of media attention and investment and formed the basis for many start-ups. But beneath these applications lies a common technological foundation: general-purpose computing on graphics processing units (GPGPUs) to accelerate massively parallel computations. While the long-term impact of AI and blockchain is yet to be felt, GPGPU has already demonstrated its immense value across a multitude of fields and application areas, despite receiving significantly less public attention.

GPU programming is traditionally taught through low-level programming languages such as C or C++. This book takes a different approach and teaches GPGPU through various libraries available in Python 3. This makes the subject more accessible to our target audience: data scientists and researchers who primarily use Python and seek to accelerate computationally intensive code. This book focuses entirely on the CUDA platform, which is the most popular GPU programming framework that runs exclusively on NVIDIA hardware.

In this chapter, we will learn what GPGPU and CUDA are and how to recognize scenarios that benefit from GPGPU. We will also learn how to estimate and measure the benefits of accelerating our computations using GPGPU. We will also review the limitations of GPGPU, because unfortunately, it is not a magic bullet that can speed up any computation:

Understand the benefits, application areas, and limitations of GPU computing
Calculate the theoretical compute capacity of devices
Identify which types of problems benefit from massive parallelization, and estimate performance gains with Amdahl's law
Recognize additional factors, such as data transfers, that influence computing performance
Use cProfile and Scalene to discover bottlenecks in Python code

Your purchase includes a free PDF copy + exclusive extras

Your purchase includes a DRM-free PDF copy of this book, 7-day trial to the Packt+ library (no credit card required), and additional exclusive extras. See the Free benefits with your book section in the Preface to unlock them instantly and maximize your learning.

What is GPGPU?

A graphics processing unit (GPU) is a specialized processor originally designed to render images on a computer screen. This requires calculating the color of millions of pixels on the screen and updating it multiple times a second. Calculating the value of each pixel is independent and can therefore be computed in parallel. GPUs were designed to perform millions of these computations in parallel to enable fast rendering of images for applications such as computer games.

A few decades ago, researchers realized they could employ the massive parallelization power of GPUs to dramatically speed up calculations unrelated to graphics. However, writing non-rendering GPU programs used to be extremely difficult and out of reach for most programmers. That was until CUDA was released.

What is CUDA?

CUDA is a platform first released in 2007 by the company NVIDIA for making GPGPU more accessible. NVIDIA is the market leader in designing chips for GPUs. CUDA includes a high-level API, a programming model, and several libraries for running non-rendering algorithms on NVIDIA GPUs. CUDA unleashed the power of GPGPU in numerous domains by enabling programmers to easily write massively parallel algorithms.

CUDA is the main enabler of GPGPU and, by extension, the AI boom. It is what makes NVIDIA hardware the de facto standard for GPGPU. This unique market position is reflected in NVIDIA's remarkable stock price performance over the past few years:

Figure 1.1 – NVIDIA stock price evolution over time (source data: Kaggle)

Originally, CUDA was designed to be used primarily through CUDA C with NVIDIA's NVCC compiler. As a result, CUDA programming has traditionally been taught using C or C++. Python libraries seeking to leverage GPU acceleration often had to wrap CUDA C code, which adds complexity.

In recent years, NVIDIA has invested significantly in making CUDA more accessible directly from Python, broadening its appeal to a wider audience of programmers. These efforts are centralized in the CUDA Python project (see https://nvidia.github.io/cuda-python/latest), a collection of tools and libraries that enable Python developers to write and interact with CUDA code more seamlessly.

In the first half of this book, we will focus on numba.cuda, a key component of the CUDA Python ecosystem. Numba is a just-in-time (JIT) compiler for Python, and numba.cuda is an extension that allows us to write CUDA-like code using Python syntax. This approach eliminates the need to mix multiple programming languages, enabling us to write all our code in familiar Python. Still, numba.cuda allows us to write low-level code, which will help you understand what goes on under the hood.

In the second half of the book, we will introduce high-level libraries that abstract these primitives and simplify common functionality such as array programming, data manipulation, and machine learning. In the final chapters, we will integrate these concepts and explore real-world applications, including image processing and atomistic simulations. CUDA is the foundation for nearly all the tools discussed throughout this book.

Why GPGPU is attractive

Like a CPU, a GPU contains many cores: physical processing units capable of performing arithmetic operations at extremely high speeds. Think of each core as a tiny calculator, executing simple operations rapidly. However, the key difference lies in scale and design: while a modern CPU typically features a few complex cores optimized for fast, sequential execution, a GPU may contain thousands of simpler cores specifically designed for massively parallel execution. When a problem can be divided into many smaller, independent tasks, and those tasks can be distributed across all these GPU cores, dramatic speedups can be achieved compared to traditional CPU execution.

To quantify this difference, we can calculate the theoretical maximum compute speed of a device in floating-point operations per second (FLOPS) using the following equation:

The clock frequency is the internal metronome of the device that indicates how fast it can perform computational work, a bit like a heartbeat. It is expressed in cycles per second, or Hz. The number of clock cycles required for a single operation depends on the type of operation we want to perform (e.g., 32-bit versus 64-bit floating-point numbers) and the architecture of the device; for some operations on some devices, it is possible to perform multiple in a single cycle.

Let's use this equation to compare the theoretical compute capacity for 32-bit floating-point numbers of an Intel Core i9 processor CPU, which has 24 cores, to that of an NVIDIA GeForce RTX 4080 GPU with 9,728 cores; both are consumer-grade hardware. The CPU has a clock frequency of 3.5 GHz, and with the AVX instruction set, each core can perform 16 floating-point operations per clock cycle. By plugging these values into the equation, we can estimate that the CPU has a theoretical compute capacity of a little over 1 TerraFlop (TFLOP). The GPU has a slightly lower clock frequency of 2,210 MHz, and the cores can only perform 2 operations per cycle. Still, because the GPU has so many more cores than the CPU, the theoretical compute capacity is a whopping 43 TFLOPS!

Despite the GPU's reputation as being an expensive and power-hungry device, it is much more economical and energy efficient than the CPU when measured per TFLOP. At a retail price of around 1,100 USD and a power draw of up to 300 W, the RTX 4080 GPU costs 26 USD and consumes 7 W per TFLOP. The Intel processor retails around 500 USD and draws up to 125 W; the GPU is therefore roughly 20 times cheaper and energy efficient when measured per TFLOP!

When is GPGPU useful?

The calculation of theoretical compute capacity illustrates that the biggest factor in GPU compute power is the massive number of cores. CPUs generally outperform GPUs in terms of clock frequency and number of operations per clock cycle. This means that if we only use a single GPU core, the calculation will be slower than on the CPU. Only code that can make simultaneous use of a large number of cores will benefit from GPGPU.

GPU cores are not interchangeable with CPU cores. Whereas CPU cores operate more or less independently from each other, GPU cores are grouped into larger units called streaming multiprocessors (SMs). Cores in the same SM do not operate independently and execute instructions in a coordinated manner; we will explore the details in Chapter 5. The consequence is that a GPU is best suited for one specific type of parallelism: data parallelism.

Work can be parallelized in two fundamental ways: task parallelism and data parallelism. Task parallelism involves executing distinct and independent tasks concurrently across different workers (e.g., CPU cores). Data parallelism, on the other hand, divides a single task into identical operations executed simultaneously across multiple data elements, such as applying the same operation to every element in an array.

GPUs are optimized for data parallelism because their architecture allows thousands of threads (sequences of instructions) to execute the same instruction in lockstep, maximizing throughput for uniform workloads such as matrix operations or image processing. We will see in Chapter 3 that the data parallelism model is almost baked into the CUDA programming model.

GPUs struggle with task parallelism and divergent execution paths. When threads within a warp (a group of 32 GPU threads) follow different control flows (such as taking different branches in an if-else statement), performance degrades significantly due to a phenomenon called warp divergence. We will explain warps and warp divergence in more detail in Chapter 5.

Task parallelism on the GPU

There are strategies for employing task parallelism on the GPU using techniques such as warp specialization: assigning different tasks to different warps. However, this can result in very complex code and is reserved for advanced use cases.

Application areas of GPGPU

In the previous section, we mentioned that the GPU excels at accelerating computations by leveraging data parallelism. It turns out that many problems from diverse fields benefit from this approach. In recent years, deep learning, AI, and blockchain have received significant attention, but other application areas include the following:

Scientific computation and simulations in fields such as quantum mechanics, fluid dynamics, and astrophysics
Bioinformatics, analysis of genetic data, and predictions of protein structures (e.g., AlphaFold)
Signal processing: image, video, and audio
Computer vision and object detection
Medical imaging and CT reconstruction
Computational finance
Operations research

We aim to illuminate some of these application areas through example cases in this book.

Estimating the benefits of parallelization

While writing code on the GPU is certainly easier than it once was, writing CUDA code is still not easy. Converting a straightforward serial implementation of an algorithm into a massively parallel version that runs on the GPU takes development time. Therefore, it is important to be able to estimate up front whether the speedups we can gain at runtime are worth the effort.

A very naive approach to calculate speedup would be to divide the theoretical compute capacity of the GPU by the compute capacity of a single CPU core. In our example from the previous section, we should expect a speedup of 43 TFLOPS / 0.056 TFLOPS ≈ 770× for 32-bit floating-point operations when translating an algorithm from a serial implementation that runs on a single core of the Intel Core i9 CPU to a parallelized implementation that runs on the RTX 4080 GPU. Impressive! Unfortunately, this is very unrealistic.

Not all work is parallelizable; there will always be some serial operations in any code. The amount of code that cannot be parallelized has a massive impact on the maximum speedup. This can be illustrated with Amdahl's law, which we will now discuss.

Using Amdahl's law for speedup estimation

Suppose we have some code that requires a time, t, to run if we do not use any parallelization. Let's assume our code can be split into two portions: a parallelizable part and a non-parallelizable part. The fraction of code that is parallelizable we will denote as p, which means that 1-p is the fraction of non-parallelizable code. Therefore, in the serial implementation, the time taken by the parallelizable part is pt and (1-p)t for the non-parallelizable part.

Now, suppose we add compute resources that speed up the parallelizable part of the code by a factor, s. The time to run the non-parallelizable code is not affected. This means that the total runtime with parallelization t_s becomes the following:

The speedup we gain from adding compute resources is as follows:

This equation is known as Amdahl's law. It can give us a much more sobering estimate of the speedups we can expect. Figure 1.2 shows a plot of expected speedup, t/t_s, versus the speedup factor of the parallel part, s. Note that the x axis has a logarithmic scale. The different lines represent different values of p.

Figure 1.2 – Amdahl's law showing the total speedup versus the speedup of the parallel part for different parallelizable fractions

All curves, no matter what p is, plateau out, which means that adding more workers to speed up the parallel part of the code has diminishing returns. The reason we always reach a plateau is that as s becomes large, the fraction p/s in the denominator becomes small. In the limit as s goes to infinity, the maximum speedup we can reach is 1/(1-p). When only 50% of the code is parallelizable (p=0.5), the maximum speedup we can ever reach by adding an infinite number of workers is 2. Even when 99% of our code is parallelizable (p=0.99), we eventually reach a plateau near 100.

If we bring this back to our GPU/CPU example, the most optimistic estimate for s is 770. We can either draw a vertical line on the plot at 770 or plug s=770 into the equation to estimate more realistic speedups if we know p. The takeaway is that in many cases, the most important factor in determining our speedup is the fraction of code that cannot be parallelized, as it sets our speedup ceiling. Depending on this ceiling, we could make a determination on whether implementing a GPU version of the code is worthwhile at all.

How can we estimate p? First, we need to have a good understanding of our algorithms so that we know which parts can be parallelized. Second, we need to know the time taken by each part of our algorithm. We can figure this out using code profiling.

Profiling code for generating a Julia set fractal

Let's work through an example. We will calculate an image of a Julia set fractal. Julia sets are sets of complex numbers defined by a recursive relation. When plotted as an image, they take the shape of a beautiful fractal.

To define the Julia set, first, we define the infinite recursive sequence for complex numbers z_n, z_n+1, and c and natural number n. Julia sets are defined as the set of all complex numbers that start the sequence z₀ for which the modulus |z_n| does not diverge to infinity as n goes to infinity. In practice, when any term |z_n| ≥ R with R a cut-off radius, we know that the modulus diverges. c is a constant we choose, and it defines how the Julia set looks. We can generate beautiful fractal images by coloring all z in the complex plane based on the number of iterations until the sequence diverges.

For the not-so-mathematically inclined reader

Don't worry if the mathematical details of the Julia set are too complicated; it doesn't really matter. Just think of the Julia set as a map where each point (pixel) is colored based on how quickly a simple mathematical rule causes it to 'escape' to infinity. Points that escape slowly form the intricate, beautiful patterns of a fractal. We use the Julia set example because it can be easily parallelized and produces nice images.

A Python function that generates an image of a Julia set can be implemented rather naively as follows:

import numpy as np
from numpy.typing import NDArray

def calculate_julia(
    z_0: NDArray[np.complex64],
    c: complex,
    max_iter: int,
) -> NDArray[np.int32]:
    julia_set = np.zeros(z_0.shape, dtype=np.int32)
    R = 0.5 + 0.5 * np.sqrt(1 + 4 * abs(c))

    for i in range(z_0.shape[0]):
        for j in range(z_0.shape[1]):
            z = z_0[i, j]
            for iteration in range(1, max_iter + 1):
                if abs(z) > R:
                    julia_set[i, j] = iteration
                    break
                z = z ** 2 + c
    return julia_set

The arguments we supply to the function are the 2D input array of complex numbers, z_0, the complex constant, c, and the maximum number of times we will repeat the iteration to check whether the number is part of the set.

In the function, first, we allocate a zeroed julia_set output image, the size of which is determined by our input array. Then, the cut-off radius, R, is calculated from c according to a known formula.

In a double loop, we iterate over all elements of z_0. In the inner loop, z is repeatedly updated using the aforementioned recursive expression, a maximum of max_iter times. When |z| > R, we know the sequence diverges and z_0 is not part of the Julia set; thus, we set the corresponding element in the output array to the number of iterations until divergence and break out of the loop. If we do not break out of the loop, then the value of that coordinate in julia_set remains 0, and we assume the number is part of the Julia set after a maximum number of iterations.

Before we can run the calculation, we must first create our 2D z_0 array as follows:

a = np.linspace(-1.7, 1.7, 1200, dtype=np.float32)
b = np.linspace(-1.7, 1.7, 1200, dtype=np.float32)
A, B = np.meshgrid(a, b)
z_0 = A + 1j * B

We can then calculate the Julia set for a chosen c=-0.8+0.156j, as follows:

julia_set = calculate_julia(
    z_0,
    c=-0.8+0.156j,
    max_iter=3000,
)

We can visualize julia_set by mapping each value in the array to a color, for example, using the imshow function in matplotlib:

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 6))
im = ax.imshow(julia_set, cmap="magma",
               vmax = 0.25 * np.max(julia_set))

We specify vmax to increase the contrast. Then, we can save the image to a file using the following:

fig.savefig("path/to/file")

The result is this intricate fractal image:

Figure 1.3 – Julia set with c=-0.8+0.156j

Calculating this fractal image takes a bit of time (on this machine, it took about 14.5 seconds). This is because we calculate each pixel sequentially. However, each pixel is independent, so we could potentially calculate all of them in parallel.

Creating our z_0 input array and allocating memory for the julia_set output array are sequential operations, and so are plotting and saving our image to disk. We can time all these different parts using time.perf_counter to estimate how much our code would benefit from parallelization. To do this, we can wrap the relevant pieces of code as follows:

import time
start = time.perf_counter()
# code we want to time here
end = time.perf_counter()
print(end - start)

On the machine where these code examples were run, creating z_0 takes about 0.005 seconds, and plotting and saving the result takes 0.1 seconds. Calculating the Julia set takes about 13.8 seconds. The time to allocate a zeroed julia_set array is negligible. Therefore, if we assume that calculate_julia is fully parallelizable and the rest is fully serial, then p = 13.8/(13.8+0.105) = 99.2%. That means, according to Amdahl's law, by adding an infinite number of compute resources, we could only reach a speedup of 1/(1-0.992)=125×. All these numbers will vary depending on the specific hardware on which the examples are run.

Performing more advanced profiling with cProfile and Scalene

In this simple example, we could manually identify the interesting pieces of code to profile. As our applications get more complex, this approach does not scale. Therefore, it is important to be aware of profilers such as cProfile and Scalene.

cProfile is a profiler that is part of the Python standard library. If our Python code is in a script, we can profile the script using the command line as follows:

python -m cProfile -s cumtime script.py

We use the -s flag to sort by cumtime, which is the total time spent in the function, including sub-functions. When we profile only the code that calculates the Julia set and exclude the code that plots and saves the result, we get a table that looks something like this (rows are split over two lines to fit on the page):

34137406 function calls (34135254 primitive calls) in 21.307
seconds
   Ordered by: cumulative time
   ncalls  tottime  percall  cumtime  percall \
filename:lineno(function)
     96/1    0.000    0.000   21.307   21.307 \
{built-in method builtins.exec}
        1    0.003    0.003   21.307   21.307 \
cprofile.py:1(<module>)
        1   16.710   16.710   21.168   21.168 \
cprofile.py:5(calculate_julia)
 34040305    4.459    0.000    4.459    0.000 \
{built-in method builtins.abs}
    ...

We get a table that shows how often each function was called (ncalls), what the total time was spent in each function over all calls (tottime and cumtime), and how much time was spent per call of each function (percall). The difference between tottime and cumtime is that tottime tries to exclude the time spent in sub-functions. For example, cProfile measures the built-in abs function, which is used inside the calculate_julia function. According to cProfile, evaluating |z| takes about 4.5 seconds or roughly 20% of the total runtime. The total time of calculate_julia is the time taken by abs subtracted from the cumulative time.

Notice how running the code with the profiler is substantially slower than running it without the profiler. This is because profiling adds a tiny bit of overhead to each function call, and in this example, there are a lot of small function calls, which makes the overhead very large. With the profiler, this code takes more than 50% more time than without (21 seconds versus 14 seconds), which means that these results are not very reliable in an absolute sense; a manual timing of specific code sections will give more accurate results. Still, cProfile can give us quick insight into the bottlenecks that exist in our code. It is convenient and always available in the Python standard library.

Using cProfile in Jupyter Notebooks

When working in a Jupyter notebook, we can use the %%prun magic command at the start of a cell to profile that cell with cProfile. However, the Jupyter, IPython processes, and interactive backends from matplotlib can interfere with the profiler and skew the results. We highly suggest using cProfile outside Jupyter notebooks and not profiling matplotlib code with it.

cProfile is limited to timing function calls, and the results are not always reliable, as our example shows. We want to get more insight into the bottlenecks of our calculate_julia function, but this is not possible with cProfile unless we split our function into more useful subfunctions.

In this case, a line profiler is more appropriate. With line profilers, individual lines of Python are timed. Scalene is a powerful profiler that can do line profiling. In addition, it performs memory profiling, GPU profiling, and measures how much time is spent in Python versus "native" (i.e., compiled) code. It also claims to add a minimal amount of runtime overhead, thereby providing reliable timings. It can also be easily used from Jupyter notebooks. It is a third-party package that can be installed with any Python package manager, such as pip, conda, or pixi.

To use Scalene from a Jupyter notebook, first load the extension:

%load_ext scalene

Then, we can profile code by adding the following to the top of the cell that needs to be profiled:

%%scalene

Alternatively, we can put our code in a .py script file and run it with scalene my_script.py. Note that in both cases, for line profiling to be useful, we need all the relevant code in the same script or Jupyter notebook cell. We will get output that looks something like this:

Figure 1.4 – Line profiling results obtained with Scalene

The leftmost column shows the percentage of time spent by the Python interpreter. The results show that almost 80% of the time of all our code is spent on the line that updates z in the calculate_pixel function. We could not derive from the results of cProfile, as this line is not a function. We also see that a good chunk of time is spent on evaluating the abs function, which is in line with the cProfile results.

Using this type of information, we can identify bottlenecks in our code, which should be the first target for optimization.

Understanding the factors that limit parallelism

In the previous section, we made an important omission about Amdahl's law: the fraction of parallelizable code is not a constant but depends on the size of our inputs. Obviously, if we calculate a smaller image and use fewer iterations, the time to calculate the Julia set decreases, and so does the fraction of parallelizable code. The larger our problem, the larger the fraction of parallelizable code, the more we will benefit from parallelization. Therefore, we don't only need to understand the code; we also have to know what the size of the problem will be.

Secondly, Amdahl's law assumes we can infinitely speed up the parallel part by adding more resources. However, in our Julia set example, it is clear that there is no benefit to adding more workers than there are pixels. The number of tasks is a discrete number, and so is the number of workers. The best we can do is to have one worker per task, in which case the runtime is reduced to the slowest-running task. If some workers get more work than others, some workers will be idle as they wait for busy workers to catch up. This is relevant in our Julia set example, since some pixels converge in just one iteration, while others require the full max_iter iterations to complete.

Third, Amdahl's law only applies to embarrassingly parallel problems, for which tasks have no interdependencies. The expected speedup of the parallel part is assumed to be linearly related to the compute capacity. This is the case for our Julia set example: each pixel can be calculated independently. However, for other problems, tasks may have interdependencies, in which case they need to wait for each other or communicate with each other. Communication and synchronization add overhead, which reduces the effectiveness of parallelization.

Finally, Amdahl's law assumes the bottleneck in running parallelizable code is the compute capacity. However, many problems are memory-bound (i.e., the bottleneck is the speed at which cores can get the data they need). The speed of these data transfers is governed by two key concepts: latency and bandwidth. Latency refers to the delay between requesting data and receiving it; imagine waiting for a single package to arrive by mail. Bandwidth, on the other hand, is the maximum rate at which data can be transferred, such as the number of packages a delivery truck can carry in an hour. For small amounts of data, latency dominates the total transfer time, while for large amounts of data, bandwidth becomes the limiting factor. If workers are waiting for data, then adding more workers will not speed anything up. Memory bottlenecks are so important to many problems that all modern processors, including both CPUs and GPUs, incorporate multiple layers of cache to optimize data locality, keeping frequently accessed data close to the compute units. When we start to optimize code, it is essential to figure out whether our problem is compute-bound or memory-bound.

Despite its limitations, Amdahl's law is a useful rule of thumb that can help us estimate an upper bound on the expected benefit of adding more workers. However, due to the aforementioned reasons, such as memory bottlenecks and synchronization, performance rarely scales linearly with compute capacity. Therefore, the only way to measure the benefits of parallelization is to run the code with different levels of parallelism.

Measuring the benefits of parallelization empirically

Let us measure the effect of adding workers for our Julia set example using numba, and illustrate why Amdahl's law is still too optimistic.

As mentioned earlier in the chapter, numba is a JIT compiler for Python. In the next few chapters, we will focus on the numba.cuda plugin, which allows us to use Numba to write CUDA code. Here, we will compile Python code to machine code for execution on the CPU and parallelize it over CPU cores.

To compile a Python function with numba, we use the njit decorator on the function as follows:

from numba import njit, prange
@njit
def calculate_julia(...):
    ...

When we invoke the function for the first time, numba will figure out the types we pass into the function and compile the function to a fast machine code version specialized to those types. Compiling a function takes a bit of time, especially with complex functions. However, compiled versions of the function are cached, so executing the function a second time will be very fast. Whenever we pass different types into the function, numba has to compile another version of the function.

njit versus jit

The jit decorator can also be used for compiling a function. The difference with njit is that jit allows for "partial" compilation. If Numba cannot compile all parts of a function, it will fall back on the Python interpreter for those bits of code. This has a large negative impact on performance. With njit, which stands for no-python JIT, we disallow this behavior and force Numba to throw an error when it cannot compile the entire function.

numba supports auto-parallelization of loops using prange. When we replace range with prange in a loop, we tell numba that it can distribute different iterations of that loop among multiple workers, in this case, CPU cores. Remember, no GPU is involved at this point. If our CPU has only a single core, there will be no parallelization. To enable parallelization, we must pass the parallel=True argument to the @njit decorator; otherwise, prange behaves identically to range. In our implementation of calculate_julia, only the range of the outer loop was replaced with prange. We could also put prange in the second loop, but profiling showed that this has no effect on performance. Hence, the JITted calculate_julia function looks as follows:

@njit(parallel=True)
def calculate_julia_jit(
    z_0: NDArray[np.complex64],
    c: complex,
    max_iter: int,
) -> NDArray[np.int32]:
    julia_set = np.zeros(z_0.shape, dtype=np.int32)
    R = 0.5 + 0.5 * np.sqrt(1 + 4 * abs(c))
    for i in prange(z_0.shape[0]):
        for j in range(z_0.shape[1]):
            z = z_0[i, j]
            for iteration in range(1, max_iter + 1):
                if abs(z) > R:
                    julia_set[i, j] = iteration
                    break
                z = z ** 2 + c
    return julia_set

This code was run and timed using one, two, four, and eight threads, using numba.set_num_threads, on a CPU with eight cores. The results are plotted in Figure 1.5, which shows execution time, speedup (ratio of execution time with one worker and the execution time with n workers), and efficiency (ratio of speedup and the number of workers):

Figure 1.5 – Time, speedup, and efficiency of calculate_julia as a function of the number of workers

The measured times are compared to what would be expected based on ideal scaling (i.e., doubling the number of workers halves the execution time). The results show the diminishing returns of adding workers. Whereas going from one to two cores very closely follows ideal scaling (efficiency is close to one), adding any more workers barely improves the runtime.

These timings only take into consideration the calculate_julia_jit function, which we considered to be almost 100% parallelizable. Therefore, the ideal scaling lines should closely reflect what Amdahl's law predicts. Our measurements show that even Amdahl's law is far too optimistic, and many more factors are at play, as mentioned in the previous section.

JIT compilation performance

Compiling the function with numba.njit drastically reduced the runtime from around 14 seconds to around 0.35 seconds when using a single CPU thread. This impressive performance gain can be achieved because Numba circumvents the Python interpreter entirely and compiles the code to machine code with near-C performance.

This does not mean that there is no point in using the GPU where we could leverage thousands of threads/cores. However, we cannot simply extrapolate the CPU graph for the GPU, since we cannot directly compare CPU and GPU implementations. CPU and GPU threads are different, the flow of data is different, and the execution models are different.

To illustrate, we ran a GPU implementation of the calculate_julia algorithm on an A100 GPU, which has over 6,000 CUDA cores. We will learn how to write GPU implementations of algorithms in Chapter 3; in this chapter, the goal is only to illustrate a point on performance. Just like in the CPU implementation, we ran it multiple times on the GPU with different numbers of threads, starting with one and going up to about four million. The execution times versus the number of threads are shown in Figure 1.6 on a log-log graph. For comparison, the CPU execution times, which we measured earlier, are also plotted, in addition to the times that would be expected given ideal scaling. One arrow indicates the number of CUDA cores available on the device; the other arrow points to the number of array elements that need to be calculated, which is equivalent to the number of tasks that can be run in parallel.

Figure 1.6 – Comparison of execution time between CPU and GPU implementations for different numbers of threads

The figure highlights that a CPU thread cannot be compared to a GPU thread. With a small number of threads, the GPU is heavily underutilized, and the performance is almost 100× worse compared to one CPU thread. Only when we launch about 500 GPU threads is the performance comparable to one CPU thread. Performance peaks when the number of threads is approximately equal to the number of parallelizable tasks (roughly 1.5M). In this case, we achieve a performance that is 20× better than 8 CPU threads and almost 50× better than one CPU thread. This is good, but almost an order of magnitude removed from what we might expect based on the difference in theoretical compute capacity: the A100 GPU has a compute capacity of 19.5 TFLOPS for 32-bit floats, whereas one CPU core has a compute capacity of roughly 0.05 TFLOPS; a difference of almost 400×.

When we run the calculation with more threads than tasks, execution time increases slightly compared to the minimum. While GPU threads incur only a small overhead, launching threads that have nothing to do obviously decreases performance.

Typically, we do not think of the GPU as a collection of cores, but as a single powerful processor that we aim to fully occupy. A rule of thumb and a good starting point to fully occupy the GPU is to launch a thread for each parallelizable task. The GPU is best equipped to schedule these threads efficiently on the available cores.

This analysis omits a lot of important details, since CUDA does not allow us to simply select the number of threads we want to use to execute some code. CUDA kernel execution involves additional concepts like grids and blocks. We will learn all about that in Chapter 3.

The main point of this section is to demonstrate that the most reliable way to figure out how much parallelization benefits our code is to measure its effects. Even code that we believe to be "fully parallelizable" may have additional bottlenecks that limit the effectiveness of additional workers.

Answers

The non-parallelizable part of the code is 10% or 0.1. Therefore, according to Amdahl's law, the maximum speedup is 10×. If we want to reach 90% of that, we want to achieve a speedup of 9×. If we plug this into Amdahl's law again and solve for s, we need to speed up the parallelizable part 81×, so if we assume linear scaling, we need 81× the amount of compute resources. To reach 99% of the maximum speedup, we want to reach a speedup of 9.9×. This corresponds to s=891; in other words, we need 891× the amount of compute resources. In reality, we would need even more resources to account for imperfect scaling.
Memory bottlenecks, coordination/synchronization/communication between workers, and workload imbalance between workers.
No. The loop serves to sequentially update a single variable. If we distribute this over multiple threads, we will get completely non-deterministic behavior as different threads compete to try to update z.
The key is that, while the Numba decorated function looks like a Python function, it no longer is a Python function. Numba compiles the function to machine code, which is similarly efficient as compiled C and Fortran.
No. Because the JITed function is no longer running on the Python interpreter, it is no longer being run line by line, and line profiling becomes meaningless.
GPU threads are very lightweight, and the GPU is very good at efficiently scheduling those threads and switching between them. When there is more work to be done than there are threads, we are trying to do the job of the scheduler, because we are instructing our threads to do multiple tasks that could be parallelized. CPU threads are heavyweight, and typically, there is no benefit to adding more threads than the number of cores.