Hands-On GPU Computing with Python

4.7 (3 reviews total)
By Avimanyu Bandyopadhyay
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Introducing GPU Computing

About this book

GPUs are proving to be excellent general purpose-parallel computing solutions for high performance tasks such as deep learning and scientific computing.

This book will be your guide to getting started with GPU computing. It will start with introducing GPU computing and explain the architecture and programming models for GPUs. You will learn, by example, how to perform GPU programming with Python, and you’ll look at using integrations such as PyCUDA, PyOpenCL, CuPy and Numba with Anaconda for various tasks such as machine learning and data mining. Going further, you will get to grips with GPU work flows, management, and deployment using modern containerization solutions. Toward the end of the book, you will get familiar with the principles of distributed computing for training machine learning models and enhancing efficiency and performance.

By the end of this book, you will be able to set up a GPU ecosystem for running complex applications and data models that demand great processing capabilities, and be able to efficiently manage memory to compute your application effectively and quickly.

Publication date:
May 2019


Chapter 1. Introducing GPU Computing

Many years ago, I used to think that a graphics processing unit (GPU), more commonly known as a graphics card, was just a device dedicated to playing video games on a computer at their maximum potential. But, one day, while going through a textbook (Advanced Computer Architecture by Kai Hwang), I realized that I was unaware of a world that goes way beyond PC gaming.

Without a doubt, most consumer GPUs are manufactured to achieve those amazing graphics and visuals to enable some spell-binding gameplay. But there's a world that explores its application a whole lot further, and that is the world of GPU computing.

In this chapter, we are going to learn the basic ideas behind GPU computing, a historical recap on computing, and the rise of GPU computing. We will also read about the simplicity of Python and the power of GPUs, and learn about the scope of applying GPUs in science and AI. Summarized research work by scientists working in different fields through GPUs will help you in following the same.

In the final part of this chapter, we will be able to understand the social impact of GPUs by learning about more research work in fields beyond science. By the end of this chapter, you will have developed a general idea about implementing your own GPU-enabled applications, regardless of your field of study.

This chapter is divided into the following sections to facilitate the learning process: 

  • The world of GPU computing beyond PC gaming
  • Conventional CPU computing – before the advent of GPUs
  • How the gaming industry made GPU computing affordable for individuals
  • The emergence of full-fledged GPU computing
  • The simplicity of Python code and the power of GPUs – a dual advantage
  • How GPUs empower science and AI in current times
  • The social impact of GPUs



The world of GPU computing beyond PC gaming

If you are a PC gamer, you must be very familiar with the world of graphics cards. Depending on their specifications, you might also be familiar with how each of them would affect your gaming experience. Let's explore extensively what lies beyond that domain through the subsequent sections in this chapter.

What is a GPU?

A GPU, as the initialism suggests, is an electronic circuit that serves as a processor for handling graphical information to output on a display. The scope of this book is to go beyond just handling graphical information and stepping into the general purpose computing with GPUs (GPGPU) arena. GPGPU is all about the use of what is typically performed with central processing units (CPUs), which we are going to discuss in detail in the next section. The terms GPU and graphics card are used interchangeably very frequently. But both are in fact quite different. The graphics card is a platform that serves as an interface to the GPU. Just like a CPU is seated over a motherboard socket, a GPU is seated over a socket on the graphics card (comparably to a motherboard, we may think of it as a mini-motherboard but only to facilitate the GPU and its cooling).

What about computing? The word computing is most obviously derived from the word compute. To compute is simply to harness your own hardware to deploy applications with the help of your own programmable processes. Programmable processes are a set of rules defined by you that are always ready to operate at your disposal. They are, of course, based on your own algorithms, which allow you to address your own specific requirements, depending on the application at hand.

If you look at computing on a universal scale, you'd find that the specific requirement that we speak of in the previous paragraph isn't just limited to computer science. Computing can be inferred as a technique to calculate any measurable entity that can belong to any field, be it the field of science or even art. Now that we have described the terms GPU and computing individually, let's go ahead with an introduction to our primary topic: GPU computing.

As we can comprehend by now, GPU computing is all about the use of a GPGPU with program code that executes on GPUs. When a GPU programmer writes a GPU program, the primary motive is to handover a certain workload that is computationally much more intensive for a CPU to handle.


Within the code, the CPU is instructed to hand over those particular operations to the GPU, which are then computed by the GPU. When these computations are done, the GPU sends back all of this information to the CPU and shows that output to you. Since the results are computed many times faster, such work can also be called GPU-accelerated computing.


Conventional CPU computing – before the advent of GPUs

Before GPUs arrived, general-purpose computing, as we know it, was only possible with CPUs, which were the first mainstream processors manufactured for both consumers as well as advanced computing enthusiasts.

Both computational and graphical processing were handled only by them. This meant that both the tasks of processing and handling computation of input and showing its corresponding computed output on a display were all handled by a CPU.

The history of general-purpose computing goes way back to the 1950s, before GPUs arrived and revolutionized the concept. The 1970s witnessed the rise of a new era, when the first commercial CPU, the Intel 4004, was released by Intel in 1971. The first AMD CPU was also launched in the 70s with the launch of AM2900 in 1975. There was no looking back, and a new cycle of CPU manufacturing came into effect, bringing up a new range of microprocessors for every generation.

Though Intel and AMD are the popular competitors in the CPU sector, there are other manufacturers as well, such as Motorola, IBM, and many others. Qualcomm and MediaTek, in particular, dominate the mobile industry.

Since this book is going to be about GPU computing with Python, let's briefly look back at how CPU computing evolved, before Python had any GPU implementations developed. If we want to learn about the computing power of CPUs, we have to look into how modern CPUs evolved before GPU computing wasn't heard of or deployed.

Since the inception of the third generation of Integrated Circuits (ICs) and microprocessors, the thinking has always been about how much power you can put into a single chip to get the maximum performance out of it. In the early 60s, a chip used to contain just tens of transistors, but that number rose to tens of thousands during the 70s. In the 80s, it became hundreds of thousands, while today's chips contain billions. So, how much power can you put into a single chip to get the maximum performance out of it?


This is why CPUs are evolving continuously. During this time, both Intel and AMD invented new technologies to improve CPU design. Being in the same field, they entered into a 10-year agreement in 1981 to enable mutual technology exchange. Dual core, Core 2 Duo, and many other technologies became popular.

But, eventually, a time arrived when the need for a device to accelerate general-purpose CPU computing was acknowledged. That's when GPGPUs entered the arena and the processing power of general CPUs increased tenfold.


How the gaming industry made GPU computing affordable for individuals

Gaming is over a $100 billion USD industry. But way back in the 1950s, video games were purely made for academic purposes. Video games were a medium to demonstrate the capabilities of a newly invented technology. They were also a good application to test AI applications through tic-tac-toe or chess. But access to such platforms was still limited to computer lab environments.


Spacewar became the first purpose-built computer game in 1962.

By the 1970s, the area of gaming started to change. Arcade gaming became very popular. The PC gaming landscape took proper shape in the 80s with programmable computers in almost every household equipped with popular games such as Super Mario Bros, Donkey Kong, Prince of Persia, and more.

The 90s saw the emergence of legendary games such as Doom and Quake, which radically changed the PC gaming scenario. Many PC enthusiasts and gamers developed an immense interest in understanding the benefit of PC hardware customization. Such options to customize PC hardware grabbed the attention of many to enable smooth gameplay and the best possible visuals at that time.

During this time, the console market also started to hit the roof, which continued through the 00s, with branded hardware shipped as a single unit. Later, many became curious about the specifications of these devices to learn about their full potential, and even today when a new console arrives on the market, it is a very common to debate about the GPU that lies inside the new console.


By 2016, there were over 2 billion gamers, and half of them lived in the Asia-Pacific region. As we can see, the rise of the gaming industry is known to many, and a graphics card is a necessary requirement to get the most out of games that can deliver some amazing visual experiences.

Integrated graphics, as seen in many Intel systems, could not keep up with the game developers' or players' requirements. So, a time came when the gaming industry took off massively and GPUs began to get much cheaper, enabling an affordable market of GPGPUs. Previously, they used to be very expensive when the PC gaming industry wasn't so popular. Computer scientists began to tap into this fantastic resource and so began an incredible adventure in the field of accelerated science.


The NVIDIA GeForce 3 Ti 200 was the first ever programmable GPU.

One of the early significant breakthroughs with GPGPUs that were applied for scientific computing was the use of 3,000 Tesla GPUs for finding the chemical structure of the HIV protein in order to create better drugs for battling the virus, which affects millions. On a CPU computing model, the same effort would have required to be five times larger to pursue the same objective.

Due to the huge demand for PC gaming hardware and the GPU being a prime component for amazing visuals, the graphics card became a quintessential element of every PC used for video games.

Also, many individuals in the community of technology enthusiasts were not just gamers. Many of them included quite a number of programmers as well. So, that's when the magic happened, creating a new community for GPU programming.


The emergence of full-fledged GPU computing

From the first GPUs to the most powerful GPUs seen today, GPUs continue to make a noticeable mark upon society with limitless applications, as we are going to see in the The social impact of GPUs section of this chapter. For now, let's look into how GPU specifications evolved since they became available at much reduced costs, since the rise of the gaming industry.

GPU computing has massively grown in the last two decades with the creation of GPU application programmable interfaces (APIs) such as Compute Unified Device Architecture (CUDA) and OpenCL. These APIs allow the programmer to harness the parallel computational elements within the GPU.

Let's compare these two APIs:



CUDA has been specifically written for NVIDIA GPU architecture.

OpenCL is not architecture-specific and is more commonly known as a computing standard. You can write OpenCL code for both NVIDIA and AMD GPUs.

CUDA is a proprietary API from NVIDIA available specifically for their GPUs.

OpenCL is an open computing language as the abbreviation suggests and is not limited to any particular platform. In other words, OpenCL is cross platform in nature.

CUDA programs can only be developed with NVIDIA GPUs.

You can develop programs on OpenCL, regardless of the manufacturer's brand, be it NVIDIA, AMD, or any other.

CUDA was first released in 2007, starting with version 1.0.

OpenCL 1.0 was originally released by Apple in 2009 and subsequent development was carried out by the Khronos group.

Over the years, there have been several releases of the API, with version 10.0 being the most recent release, on September 19, 2018.

The most recent version of OpenCL is 2.2-8, which was released on October 8 2018.


As we can see from the comparison, CUDA is now a leading API in the field of GPU programming. Some of the libraries available with CUDA are useful for linear algebra, fast fourier transforms, random number generation, and many other computational implementations.

The basic idea of a CUDA or OpenCL operation works with three steps:

  1. Transfer data (meant for intensive computation) from the main memory to the GPU
  2. Use the CPU to invoke the GPU kernel for computing on that data
  3. After the results are computed, they are transferred back to the main memory from the GPU memory

Since its inception, CUDA has received great academic support and is backed by NVIDIA. As GPUs become better and better every year, new technologies and libraries also keep evolving to maximize the productivity of the previous process at minimum latency. NVIDIA GPUs are classified into different groups based on their compute capabilities. Starting from 1.0, the most recent compute capability number is 7.5, if we consider the recent GPUs based on the Turing architecture at the time of writing this book.


The different architectures of NVIDIA GPUs in chronological order are Tesla, Fermi, Kepler, Maxwell, Pascal, Volta, and the most recent one, Turing, which was released over the years. The different architectures of AMD GPUs (previously ATI), again in chronological order, are the R series of cards starting from R100-R600, R670, and R700, followed by the Islands series, Northern, Southern, Sea, Volcanic, and Artic GPUs. Since 2017, the most recent architecture is called Vega.

In 2016, AMD released a software suite called GPUOpen that can be used to create GPU computing applications for general-purpose usage. GPUOpen is not proprietary but entirely open source.

The rise of AI and the need for GPUs

CUDA paved the way to the development of libraries such as cuDNN for deep neural networks, which are essential in the field of deep learning as a new approach within machine learning and as a part of AI programming through GPUs. The cuDNN library can be installed and invoked within your developmental code.

AI studies began in the 1950s and led to the creation of machine learning. Machine learning has now evolved toward deep learning, which uses neural networks implementations to train AI algorithms on large datasets. Over the years, several machine learning and deep learning libraries have been created and are under active development, such as TensorFlow, Keras, Theano, and many others.

Going a step further as a subset of such machine learning libraries, new libraries dedicated toward specific scientific studies have also been developed. Deepchem is an excellent example of the same, developed with machine learning libraries, which we will study in Chapter 12, GPU Acceleration for Scientific Applications using Deepchem.

The creation of such dedicated libraries have made it easier for scientists to think more about developing useful applications rather than the creation of libraries from scratch. These efforts are more significant for scientists coming from backgrounds such as biology who want to deploy these applications as quickly as possible with a simplified understanding of code. Such applications also make collaborations between computer scientists and bioinformaticians much more convenient and easier.

So, why are GPUs needed for deep learning, machine learning, or AI? Before we try to understand the reason, it is important to know in brief what big data is. Big data, as the term suggests, involve the process of data mining and data warehousing on enormous amounts of data. As an example, let's take the study of genomics. In the field of modern genomics, the average size of data in a simple run of DNA sequencing generates can be up to 1 TB per run.



Computational data is not restricted in any particular field, but can belong not only to scientific but also geographical data. Think of the datasets the world map can generate. Historical studies based on archaeological datasets can also be quite large. Consider the example of the digital restoration of a priceless painting or artifact. The imagery data can be huge but perfect for use with GPUs, which have huge amounts of memory. Recent NVIDIA GPUs have up to 32 GB of video memory!

GPU computing can speed up the ability of a system to read such big data. When AI algorithms train on datasets (as is the regular process in deep learning), bringing in GPUs can reduce the training time by a huge timescale, and this is why the use of GPUs in machine learning can be so crucial.

Today's GPUs have thousands of cores, thus enabling a significant performance boost many times faster than CPUs. For the same reason, many machine learning developers now have separate GPU versions for their libraries.

As AI acceleration is now a new computing model and the era of AI has just arrived, GPUs are the perfect choice for students, professionals, and scientists to learn or implement AI for their research.

The Volta architecture, which we just mentioned, is an AI-focused architecture from NVIDIA. The GPUs belonging to the Volta micro-architecture from NVIDIA contain tensor cores, which are designed especially for deep learning. The introduction of these tensor cores enables a huge increase in the throughput and efficiency of deep learning applications. This new technology makes Volta GPUs perform training and inference tasks three times faster than the Pascal architecture GPUs. The NVIDIA Tesla V100, for example, is a Volta GPU and contains 640 tensor cores, hundreds of which can operate in parallel. Tesla is also the name of an earlier micro-architecture from NVIDIA, but it is no longer used to make GPUs. NVIDIA still uses it as a brand name, as a tribute to Nikola Tesla, a legend in the world of science.

Google is also quite ahead in the AI race with its own line of tensor processing units (TPUs). These TPUs are specifically built as application-specific integrated circuits (ASICs) to accelerate neural network-based machine learning. As these TPUs are extremely powerful, liquid cooling became a necessary step to get the best and optimum performance out of these AI accelerators. There have been three generations of TPUs to the present day. The most recent and third-generation TPU (3.0) can perform machine learning computations of up to 100 petaflops.



The simplicity of Python code and the power of GPUs –  a dual advantage

Python is a programming language with syntax that is very easy to grasp and understand, especially for computational analysts from backgrounds other than computer science. Due to this reason, it is adopted quite whole-heartedly throughout the entire research community in the world. When we also consider the powerful computational capabilities of GPUs, a dual advantage is clearly noticeable, when combined with the simplistic nature of Python syntax.

The C language – a short prologue

The C language was created as a procedural and structured programming language, developed between 1969 and 1973 at AT&T Bell Labs by Dennis M. Ritchie. As the legendary programming language gave birth to many well-known and popular programming languages, software, and operating systems, we can never forget the contributions of the C language to the developer community.

The heart of the Linux-based OSes, the Linux kernel, was originally designed in C in 1991 by Linus Torvalds. This universal platform is found in almost every device, big or small. Without C, perhaps Linux would never had come into existence. There wouldn't be any Android phones either, as Android is completely Linux-based. There wouldn't be any C++ as well, without C, and the contribution of these two languages in the development world will always hold paramount importance.

From C to Python

Python is one of the greatest programming languages ever built. It was developed by Guido van Rossum in 1990. Today, Python is exhaustively used in numerous fields. CPython is the most popularly used interpreter and reference implementation of Python. C and Python's programming syntax are very similar. So, if you are migrating from programming in C to Python, you can do the transition quite quickly.

Even if Python is not your first programming language to begin with, it's a great way to get introduced to the programming world. Python can also be an ideal language for someone who is learning programming for the first time and choosing Python for the same.



Python comes installed by default whenever you install a Linux-based distribution. The present community of AI developers prefer Python over others for machine learning. TensorFlow, scikit-learn, Theano, and Keras are some good examples.

The simplicity of Python as a programming language – why many researchers and scientists prefer it

Python is loved by many to develop an application or utility due to its simple and easy-to-understand syntax. A novice user will not take long to get a brief idea on what a few lines of Python code are meant to deliver. For example, in C, we import other header files with the #include prefix, whereas in Python, we can simply use the import prefix. The latter clearly makes it easier to understand. Furthermore, the following example to print a simple text via both clearly explains the same:

The following is a program to print Hello Reader! in C:

#include <stdio.h> 
int main() {
     printf("Hello Reader!"); 
     return 0; 

The following is the equivalent program in Python 3:

print('Hello Reader!')

We can see an example in C for displaying Hello Reader!  as an output. The same can be done with just a single line in Python.

Due to the extreme simplicity of Python syntax, many researchers and scientists who come from a non-CS background find it much easier to get started with programming in Python.

You can also use Python to develop your own programming language.

The power of GPUs

GPUs have come a long way from when they were used just for graphics applications. For decades now, their significance continues to gain attention in limitless fields of applications due to their unique advantages over traditional CPUs.


Empowered with thousands of cores, today's GPUs continue to be tapped into by both academia and industry in order to achieve amazing levels of parallelism in their research-focused applications. Massive CPU clusters can now be replaced with a few or, in some cases, just a single GPU server to deliver the same level of productivity.

GPUs have made it possible to create the next level of supercomputers to accelerate research on diverse fields. Big data can now be efficiently computed upon with the use of GPU parallelism. The latest NVIDIA GPU architecture is Turing, as we mentioned earlier. Some of its noteworthy facts and features are mentioned in the following sections.

With the advent of the era of AI, GPUs now are needed more than ever to handle the training of enormous datasets with various deep learning techniques. And looking at the progress in this landscape so far, it is evident that we are on the right track. 

Ray tracing

In simple terms, ray tracing, as the term suggests, is the process of how the pathways of light (rays) sources are computed (traced) to simulate their behavior on different objects within an environment. So, ray tracing allows the differentiation between lightning and shadow scenarios to be much more efficient while simulating a 3D environment with a graphics engine.

NVIDIA's latest Turing GPUs have introduced this feature with their RTX series of cards, which, when enabled, allow a virtual simulation in modern video games and graphics to become many times more realistic. There are many older technologies being employed through graphics engines such as HairWorks, PhysX, and Tress FX, the last of the three being from AMD. The RTX GPUs have 72 Turing RT Cores, and can deliver up to 11 gigarays per second of real-time ray tracing performance.

Artificial intelligence (AI)

Simply put, AI is the process of mimicking human intelligence and behavior. An AI system is said to have passed the Turing test when it convinces a human that it itself is not AI but just another human. There are a plethora of fields where AI is used and applied. Some of these fields are science, health, commerce, and transport. All of these fields and more use AI for effective precision in carrying out computational operations on datasets involving loads of data.



NVIDIA's new RTX line of GPUs are also focused on accelerated AI.

With 576 multi-precision Turing tensor cores, they can provide up to 130 Teraflops of deep learning performance and 24 GB of high-speed GDDR6 memory with 672 GB/s of bandwidth—twice the memory of previous generation Titan GPUs, which are ideal for larger models and datasets.

Programmable shading

Shading is the process of varying darkness at different levels to perceive the depth of the shadow on a 3D object based on the intensity of light falling on it. Programmable shading is the method of using algorithms to compute such shading to simulate shadows of objects in a 3D environment at realistic levels.

NVIDIA's new RTX GPUs enable advanced programmable shading by taking it further ahead with variable-rate shading, texture-space shading, and multi-view rendering.


RTX-OPS is a new performance metric from NVIDIA to calculate the compute power of each of the RTX lines of GPUs. It is based on the following formula:

Peak FP32 performance (in TFLOPS) x 80%+Peak INT32 performance (in TFLOPS) x 28%+Peak Ray Tracing performance (in Tera-OPS) x 40%+Tensor Core performance (in TFLOPS) x 20%=Tera RTX-OPS


Latest GPUs at the time of writing this book (can be subject to change)

Founders Editions are manufactured by NVIDIA themselves: the card body, Printed Circuit Board (PCB), cooling technology, and, of course, the GPU itself. Non-Founders Editions have only the GPUs manufactured by NVIDIA. The card body, PCB, and cooling technology are all manufactured by aftermarket partners.

The latest NVIDIA GeForce RTX GPUs and their specifications in brief are as follows.

NVIDIA GeForce RTX 2070

This is the entry-level GPU in the RTX series with 42 T RTX-OPS, 1,620 Mhz boost clock speed, 8 GB GDDR6 frame buffer, and 14 Gbps of memory speed. Its Founders Edition comes factory overclocked at 1,710 Mhz and 45 T RTX-OPS.

NVIDIA GeForce RTX 2080

This is the mid-range GPU in the RTX series with 57 T RTX-OPS, 1710 Mhz boost clock speed, 8 GB GDDR6 frame buffer, and 14 Gbps of memory speed. Its Founders Edition comes factory overclocked at 1,800 Mhz and 60 T RTX-OPS.

NVIDIA GeForce RTX 2080 Ti

This is the high-end GPU in the RTX series with 76 T RTX-OPS, 1,545 Mhz boost clock speed, 11 GB GDDR6 frame buffer, and 14 Gbps of memory speed. Its Founders Edition comes factory overclocked at 1,635 Mhz and 78 T RTX-OPS.


This is the most recent Titan GPU launched through the RTX series with 84 T RTX-OPS, 1,770 Mhz boost clock speed, 24 GB GDDR6 frame buffer, and 14 Gbps of memory speed.

The latest AMD Radeon RX Vega GPUs and their specifications in brief are as follows.



Radeon RX Vega 56

The Radeon™ RX Vega 56 GPU has 1,471 Mhz boost clock speed, 8 GB HBM2 frame buffer, and 1.6 Gbps of memory speed.

Radeon RX Vega 64

The Radeon™ RX Vega 64 GPU has 1,546 Mhz boost clock speed, 8 GB HBM2 frame buffer, and 1.89 Gbps of memory speed.

Radeon VII

The Radeon™ VII GPU has up to 1,750 Mhz boost clock speed, 16 GB HBM2 frame buffer, and 4 Gbps of memory speed.

Significance of FP64 in GPU computing

FP32 is a single precision floating-point format requiring 32 bits of memory allocation. Similarly, FP64 is a double precision floating-point format requiring 64 bits of memory allocation. FP64 allows high precision computing theoretically at 1:2 FP32.

It involves computing with double the number of bits as FP32, and hence many computer scientists prefer GPUs with the best 1:2 FP32 performance. Applications that require the modeling and simulation of physical environments with extreme precision and high accuracy computations will always need double precision accuracy at a high performance.

Therefore, FP64 GPUs become quite significant and absolutely needed for such purposes. But in the case of the non-computational processing of imagery or statistical data, FP32 would be sufficient. The RTX line of cards do not have FP64 support. The NVIDIA Titan V based on the Volta architecture is the latest NVIDIA GPU to support FP64.

The dual advantage – Python and GPUs, a powerful combination

The simplicity of Python code syntax enables the user to develop GPU applications with ease. The user might not be from a computer science background, but that does not stop them from contributing to research and development effectively and efficiently.


An important thing to note here is the use of an open source approach while developing Python-based GPU applications because only such a model would build transparency among the developer/researcher community and bring trust among users to use such applications to benefit humanity through any field.


How GPUs empower science and AI in current times

NVIDIA RAPIDS is a very recent example of using an open source system to carry out research related to science, AI, and other fields. There are numerous examples of research work in science that has been empowered by GPU acceleration. Let's understand its significance through some of these amazing research examples.

Bioinformatics workflow management

The following research paper is discussed in this section: Managing Complex Workflows in Bioinformatics: An Interactive Toolkit With GPU Acceleration, A Welivita, I Perera, D Meedeniya, A Wickramarachchi, V Mallawaarachchi (2018), IEEE Transactions on NanoBioscience, 17(3), 199-208, doi:10.1109/tnb.2018.2837122.

BioWorkflow is a GPU-accelerated workflow management system based on the Amazon cloud. The application uses three levels of parallelism:

  • Parallel execution inside workflow nodes
  • Parallel execution among workflow nodes
  • Concurrent execution of different instances of the workflow with different user inputs

The Amazon cloud-enabled GPU-accelerated computing allowed achieving the preceding three levels of parallelism, which helped to reduce workflow execution times and enable high speedups by approximately two to three times.


Magnetic Resonance Imaging (MRI) reconstruction techniques

The following research paper is discussed in this section: A survey of GPU-based acceleration techniques in MRI reconstructions, H Wang, H Peng, Y Chang, D Liang (2018), Quantitative Imaging in Medicine and Surgery, 8(2), 196-208, doi:10.21037/qims.2018.03.07.

GPU programming enables the provision of more easy-to-use libraries and frameworks for programmers. The remarkable role of GPUs has been noted in medical imaging, image reconstruction, and image analysis in clinical applications. But, despite the same, there are still many challenges ahead.

GPU parallel architectures require the pipeline re design of reconstruction algorithms. Pre-optimization of algorithms for GPU computing would bring huge improvements, even over GPU-based libraries that people use to deploy GPUs with.

Digital-signal processing for communication receivers

The following research paper is discussed in this section: GPU Acceleration of DSP for Communication Receivers, Gunther J, Gunther H, Moon T, Proceedings of the GNU Radio Conference, 2(1), 2017, Pmc: pmc5695887.

In this project, digital-signalling algorithms were implemented in NVIDIA's CUDA API to harness the computational power of GPUs to accelerate digital-signal processing of huge amounts of bandwidth in real time. With Ubuntu 16.04 and a GTX 1080 Ti, 20 MHz of bandwidth was consumed.

Their ultimate goal was to implement the interference, cancellation, and demodulation functions on GPU hardware and achieve faster-than-real-time execution. They mentioned the demand for higher data rates, driving the need for acceleration even further.

Specifically, in separate sections of their research papers, they clearly describe how traditional CPUs are composed of a small number of cores (four to eight), surrounded by large cache memories and compare them to a GPU, composed of hundreds of cores (100-1,000, or more) and can support thousands of threads simultaneously. They highlight the limitations of CPUs as only being able to support a few software threads at a time.

The advantage of GPUs over field programmable gate arrays (FPGAs) is also highlighted. They are often used to accelerate digital-signal processing, by stating how GPUs can be programmed using well-known extensions of the C language, such as CUDA and OpenCL.


Studies on the brain – neuroscience research

The following research paper is discussed in this section: BrainFrame: A node-level heterogeneous accelerator platform for neuron simulations, G Smaragdos, G Chatzikonstantis, R Kukreja, H Sidiropoulos, D Rodopoulos, I  Sourdis, C Strydis (2017), Journal of Neural Engineering, 14(6), 066008, doi:10.1088/1741-2552/aa7fc5.

BrainFrame is a heterogeneous acceleration platform to serve computational neuroscience studies in conducting numerous experimentation, which is often required for understanding how the brain works.

The scientists analyzed biophysically accurate neuron models, as such models are considered essential for the deeper understanding of biological brain networks. The BrainFrame system addresses the need for convenient programming and the computational requirements of the field. They made use of a high performance computing (HPC) platform that integrates three accelerator technologies, namely the following:

  • An Intel Xeon-Phi CPU
  • A Maxeler dataflow Engine

A Python package for the simulator-independent specification of neuronal network models called PyNN has been used in this project. The PyNN frontend allows the heterogeneous platform to be immediately accessible to a multitude of prior modeling works, which is essential for the wide adoption of complex HPC platforms in the neuroscientific community.

Large-scale molecular dynamics simulations

The following research paper is discussed in this section: Graphics Processing Unit Acceleration and Parallelization of GENESIS for Large-Scale Molecular Dynamics Simulations, J Jung, A Naurse, C Kobayashi, Y Sugita (2016), Journal of Chemical Theory and Computation, 12(10), 4947-4958, doi:10.1021/acs.jctc.6b00241.

Molecular dynamics (MD) in the scientific community can play an exceptional role in the discovery of new drugs to address a challenging disease. GPUs can accelerate molecular dynamics simulations to enable better productivity in drug discovery research. This paper is one of the many examples of GPU-accelerated molecular dynamics.



The researchers developed a parallelization scheme of all-atom MD simulations suitable for hybrid (CPU+GPU) processors in multiple nodes. Time-consuming real-space non-bonded interaction is calculated on GPUs, while other parts are done on CPUs. Acceleration of GPU calculations allows you to utilize the total simulation time for reciprocal-space (other) calculations on CPUs.

The two NVIDIA Tesla K40 GPUs accelerated the overall speed of MD simulations, while keeping good parallel efficiency. This development could be helpful for long MD simulations of large systems on massively parallel computers that are equipped with GPUs.

GPU-powered AI and self-driving cars

Self-driving cars have now become quite a buzz phrase. GPUs are now used extensively to power AI to take crucial decisions during travel. But there is a need for greater accuracy and reliability in this sector. Faster training on huge datasets for efficient deep learning can be heavily influenced by GPUs.

Research work posited by AI scientists

Artificial intelligence has been the talk of the town and has garnered immense attention in recent years. As a result, there has been much research encricling AI that has redefined the way we perceive technology. Lets talk about such examples from recent research in AI where GPUs have been extensively used.

Deep learning on commodity Android devices

The following research paper is discussed in this section: RSTensorFlow: GPU Enabled TensorFlow for Deep Learning on Commodity Android Devices, M Alzantot, Y Wang, Z Ren, M B Srivastava (2017), Proceedings of the First International Workshop on Deep Learning for Mobile Systems and Applications, EMDL 17, doi:10.1145/3089801.3089805.

This is an interesting project that was done by AI scientists on mobile GPUs who integrated an acceleration framework tightly into TensorFlow (as we mentioned earlier) to make good use of heterogeneous computing resources on mobile devices without the need for any extra tools. They evaluated their system on different Android phone models to study the trade-offs of running different neural network operations on the GPU.


They also compared the performance of running different models architectures such as convolutional and recurrent neural networks on the CPU only versus using heterogeneous computing resources (CPU and GPU). Their results reveal that the GPUs on the phones are capable of offering substantial performance gain in matrix multiplication on mobile devices. Therefore, models that involve the multiplication of large matrices can run approximately three times faster due to GPU support.

Motif discovery with deep learning

The following is the research paper discussed in this section: YAMDA: thousandfold speedup of EM-based motif discovery using deep learning libraries and GPU, D Quang, Y Guan, S C Parker (2018), Bioinformatics, 34(20), 3578-3580, doi:10.1093/bioinformatics/bty396.

A motif is a genetic sequence/structure that can be of biological significance, and thereby a path toward understanding genetics one step further.

YAMDA is a Python-based and GPU-enabled deep learning tool that can be used for motif discovery in large biopolymer sequence datasets that can be computationally demanding, presenting significant challenges for discovery in omics research. Omics relates to the study of genomics, proteomics, and metabolomics.

The challenges of one the most popular motif discovery software tools named MEME was addressed, highlighting its excessively long runtimes for large datasets. YAMDA takes care of this challenge as a highly scalable motif discovery software package built on Pytorch, a tensor computation deep learning library with strong GPU acceleration, highly optimized for tensor operations that are also useful for motifs.

YAMDA accurately does the same job as MEME but completes execution in seconds or minutes, which translates to speedups of over a thousandfold! Notice the connection between two different fields (science and AI) here!

Structural biology meets data science

The following is the research paper discussed in this section: Structural biology meets data science: Does anything change?, C Mura, E J Draizen, P E Bourne, (2018) Current Opinion in Structural Biology, 52, 95-102, doi:10.1016/j.sbi.2018.09.003.

Data science has a lot to do with datasets, and with datasets, you need some efficient machine learning tools to handle them conveniently. Machine learning tools, on the other hand, can rely on powerful GPUs for managing enormous datasets.


This research paper discusses the significance of both GPU computing and data science in the advancement of the field of structural biology. The researchers particularly mentions the use of deep learning libraries in drug discovery while also not leaving out the significance of the exceptional computational performance of modern GPU-equipped clusters.

Heart-rate estimation on modern wearable devices

The following is the research paper discussed in this section: Unsupervised heart-rate estimation in wearables with liquid states and a probabilistic readout, A Das, P Pradhapan, W Groenendaal, P Adiraju, R T Rajan, F Catthoor, C V Hoof (2018), Neural Networks, 99, 134-147, doi:10.1016/j.neunet.2017.12.015.

Heart-rate monitoring can be crucial, especially for diabetic and cardiac patients. CARLsim, a GPU-accelerated library for simulating spiking neural network models with a high degree of biological detail, has been used by the scientists to create novel learning techniques. This enables an end-to-end approach to estimate heart-rate in wearable devices with embedded neuromorphic hardware. Thus, it sets a benchmark for the neuromorphic community.

Neuromorphic computing is basically an engineering approach to studying the activity of the biological brain, in this case, the translation of ECG signals into a meaningful form.

Drug target discovery

The following is the research paper discussed in this section: DeepSite: Protein-binding site predictor using 3D-convolutional neural networks, J Jiménez, S Doerr, G Martínez-Rosell, A S Rose, G D Fabritiis (2017), Bioinformatics, 33(19), 3036-3042, doi:10.1093/bioinformatics/btx350.

Drug target discovery is the method of finding a potential site or pocket on a target protein where a small molecule can dock on to. This method has the larger goal of addressing a particular disease that the target protein is related to.

DeepSite is a protein-binding site predictor that uses 3D convolutional neural networks. The novel knowledge-based approach uses state-of-the-art convolutional neural networks. The algorithm learns by examples from 7,622 proteins from the scPDB database of binding sites using both a distance and a volumetric overlap approach.



The machine learning-based method that was developed by the scientists demonstrates superior performance compared to two other competitive algorithmic strategies. Users can submit a protein structure file for pocket detection to their NVIDIA GPU-equipped servers through a WebGL graphical interface to study and find new sites on proteins for potential docking.

Deep learning for computational chemistry

The following research paper is discussed in this section: Deep learning for computational chemistry, G B Goh, N O Hodas, A Vishnu (2017), Journal of Computational Chemistry, 38(16), 1291-1307, doi:10.1002/jcc.24764.

Computational chemistry is a branch of chemistry that uses computer simulations to address a chemical problem. It is very crucial in the field of medicinal studies. The researchers highlight specifically that the availability of big data coupled with technological advances in GPU hardware, which were both absent in the last century, have created a revolution. This has facilitated the advent of deep neural networks (DNNs), which are completely different compared to the artificial neural networks (ANNs) of the last century.

The paper stresses on the effect of GPU-accelerated computing on computational chemistry and also discusses the difference between machine learning and deep learning models. One important point we cannot leave out is how this paper highlights the significance of an open source approach (both code and documentation) for training neural networks on GPUs as a strong reason for the rapid growth of deep learning in recent years and also its impact on academic research, which was revealed by the growing amount of deep learning-related publications since 2010.


The social impact of GPUs

Big data is everywhere and so datasets are also everywhere, meaning that GPUs and AI can be applied to any other computational field. Since we have already discussed how GPUs can contribute in science and AI, let's read about some more examples in other fields that reveal more about how GPUs can contribute to our society.

We are going to discuss some diverse fields here, but, of course, there are no limits!


Archaeological restoration/reconstruction

The following research paper is discussed in this section: Automated GPU-Based Surface Morphology Reconstruction of Volume Data for Archaeology, D Jungblut, S Karl, H Mara, S Krömker, G Wittum (2012), Contributions in Mathematical and Computational Sciences Scientific Computing and Cultural Heritage, 41-49, doi:10.1007/978-3-642-28021-4_5.

GPUs can be used to reconstruct archaeological data that could be of historical significance and value. A team of researchers used a highly parallelized implementation for NVIDIA Tesla GPUs using the CUDA programming library to reconstruct archaeological datasets of several hundreds of megabytes within a few minutes of time, thanks to GPU speedup.

Rendering the generated triangular meshes in real time allows you to view ceramics in an interactive manner. Automated segmentation of density allows you to isolate different areas of interest, which are used as features for archaeological classification. This revealed ancient applications of bronze scales, a characteristic technique of Este pottery from the 7th-6th century BC.


So, these techniques can be crucial in the accurate restoration of ancient artifacts that can tell us a lot about lost cultures. One great example where such techniques can be of significant value is the preservation of the Kailash temple at the Ellora Caves in India, which is a heritage site of ancient value, holding the key to lost values of a great time in humanity. GPUs can contribute a lot in the 3D reconstruction of such a priceless heritage site.

Numerical weather prediction

The following research paper is discussed in this section: GPU acceleration of numerical weather prediction, J Michalakes and M Vachharajani (2008), Parallel Processing Letters, 18(04), 531-548. doi:10.1142/s0129626408003557.

This paper highlights the emerging need of increased parallelism rather than increased processor speed for the prediction of weather and the climate. The use of large-scale parallelism is described to be ineffective for many scenarios, where strong scaling is required.

The GPU computing approach with fine-grained parallelism reveals an increase in speeds of nearly 10 times for a computationally intensive portion of the Weather Research and Forecast (WRF) model on a variety of NVIDIA GPUs.


Composing music

The following research paper is discussed in this section: Machine Learning Research that Matters for Music Creation: A Case Study, B L Sturm, O Ben-Tal, U Monaghan, N Collins, D Herremans, E Chew, F Pachet (2018), Journal of New Music Research, 48(1), 36-55, doi:10.1080/09298215.2018.1515233.

Artificial Intelligence Virtual Artist (AIVA) is a revolutionary AI creation that can compose beautiful music. The deep neural network has been taught to understand the art of music composition by reading through a large database of classical partitions written by the most famous composers (Bach, Beethoven, Mozart, and others). Just by acquiring existing musical compositions, AIVA can capture concepts of music theory and create new compositions.

AIVA is powered by CUDA, NVIDIA, Titan X, Pascal, GPUs, and cuDNN, which is a GPU-accelerated CUDA library for deep neural networks.

Real-time segmentation of sports players

The following research paper is discussed in this section: Real-time GPU color-based segmentation of football players, M A Laborda, E F Moreno, J M Rincón, and J E Jaraba (2011), Journal of Real-Time Image Processing, 7(4), 267-279, doi:10.1007/s11554-011-0194-9.

GPUs can be used in different kinds of sports. Here, we are sharing an example for the game of football. Please note we aren't talking about the popular EA FIFA video game here, but actual football players!

The objective of this project was to work in real time under an uncontrolled environment of a sport event, such as a live football match. A Gaussian mixture model (GMM) was applied as a segmentation paradigm to identify football players composed of diverse and complex color patterns by analyzing football-related live images and video. An image that demonstrates the use of segmentation in real-time is available in the included reference paper (Figure 5).

Time-consuming tasks were accelerated with NVIDIA's CUDA platform, and later restructured and enhanced, significantly speeding up the whole process. The code was around 4–11 times faster on a low-cost GPU than on a highly optimized C++ version on a CPU over the same data, which was being processed at 64 frames per second in real time. An important conclusion that was derived from the study is the scalability of the application to the number of cores on the GPU.


Creating art

GPU-powered deep learning can also create some amazing works of art that could be even compared to man-made masterpieces such as Munch's The Scream. Undoubtedly, the restoration of old and damaged paintings back to their original state is also quite feasible with GPU-powered AI.

Neural algorithms enable the understanding of how humans create and perceive artistic imagery. A Titan X GPU-powered AI can be about three times faster than naive per-frame processing while providing temporally consistent output.

The following research papers can be read to learn more about this particular topic:

  • A Neural Algorithm of Artistic Style, Leon Gatys, Alexander Ecker, Matthias Bethge, Journal of Vision 2016;16(12):326. doi: 10.1167/16.12.326.
  • Artistic style transfer for videos, M Ruder, A Dosovitskiy, T  Brox (2016), Lecture Notes in Computer Science Pattern Recognition, 26-36. doi:10.1007/978-3-319-45886-1_3.


The following research paper is discussed in this section: Real-time multi-camera video analytics system on GPU, P Guler, D Emeksiz, A Temizel, M Teke, T T Temizel (2013), Journal of Real-Time Image Processing, 11(3), 457-472. doi:10.1007/s11554-013-0337-2.

Parallel implementation of a real-time intelligent video surveillance system on GPUs is described here. The system is based on background subtraction and composed of the following:

  • Motion detection
  • Camera sabotage detection (the camera could be moved, out of focus, or covered)
  • Abandoned object detection
  • Object-tracking algorithms

As the algorithms have different characteristics, their GPU implementations have different speed up rates. When all the algorithms run at the same time, parallelization in a GPU makes the system up to 21.88 times faster than its CPU counterpart, enabling real-time analysis of a higher number of cameras.



The following research paper is discussed in this section: DeepAnomaly: Combining Background Subtraction and Deep Learning for Detecting Obstacles and Anomalies in an Agricultural Field, P Christiansen, L Nielsen, K Steen, R Jørgensen, H Karstoft (2016), Sensors, 16(11), 1904. doi:10.3390/s16111904.

This paper highlights the visual characteristics of a farmer's crop field where obstacles, such as people, animals, and others, occur rarely. Particularly, these obstacles are of distinct appearance compared to an empty field (which is the most usual scenario). With an algorithm called DeepAnomaly, deep learning and anomaly detection are used to exploit the homogeneous characteristics of the field to perform anomaly detection.

In a human-detector test case, it has been demonstrated that DeepAnomaly detects humans at longer ranges (45–90 m) than RCNN (a previously existent convolutional neural network with real-time object detection with region proposal networks). RCNN has a similar performance at a short range (0–30 m). However, DeepAnomaly has much fewer model parameters and a (182 ms/25 ms) 7.28 times faster processing time per image. Unlike most CNN-based methods, it allows high accuracy, low computation time, and has a low memory footprint, making it ideal for a real-time system running on an embedded GPU.

Automated drones are also being used in agriculture to water plants. Their implementation requires the use of deep learning, and hence GPUs are being used to accelerate such complex computations, with the help of NVIDIA Jetson-TX.


The computational benefits of GPU computing in economics is discussed in the following paper. A general equilibrium model with heterogeneous beliefs about the evolution of aggregate uncertainty is highlighted.

The following research paper is discussed in this section: GPU Computing in Economics, E M Aldrich (2014), Handbook of Computational Economics Vol. 3, Handbook of Computational Economics, 557-598, doi:10.1016/b978-0-444-52980-0.00010-4.

Toronto-based Triumph Asset Management (reorganized as Amadeus Investment Partners) is using GPUs for financial analysis. They explore tens of thousands of news articles every day to predict stock market situations and to enable better investment decisions.




In this chapter, we learned about the basic concepts behind GPU computing, the history behind its evolution, and its scope of use in diverse fields. We also learned about the simplicity of Python syntax and why this simplicity can be of great significance when harnessing GPUs for computational work. In the final section, we looked at the uses of GPU applications beyond just science and AI, and we looked at various fields, such as archaeology, weather, music, sports, art, security, agriculture, and economics.

If you are a gamer and/or a computing enthusiast, from now on, you will be able to better understand the computational aspect of GPUs and Python code. The various fields of applications that were discussed in the later sections of this chapter will now allow you to have a general idea about the limitless areas in which you can create your own GPU applications. GPU application users can come from any field of work or study; they do not necessarily come from a scientific background.

In the next chapter, we will read about the significance of the system components that center around the use of graphics cards to ensure their optimized usage.


Further reading

You can read the following research papers and articles to gain more knowledge about the topics that were discussed in this chapter:

About the Author

  • Avimanyu Bandyopadhyay

    Avimanyu Bandyopadhyay is currently pursuing a PhD degree in Bioinformatics based on applied GPU computing in Computational Biology at Heritage Institute of Technology, Kolkata, India. Since 2014, he developed a keen interest in GPU computing, and used CUDA for his master's thesis. He has experience as a systems administrator as well, particularly on the Linux platform.

    Avimanyu is also a scientific writer, technology communicator, and a passionate gamer. He has published technical writing on open source computing and has actively participated in NVIDIA's GPU computing conferences since 2016. A big-time Linux fan, he strongly believes in the significance of Linux and an open source approach in scientific research. Deep learning with GPUs is his new passion!

    Browse publications by this author

Latest Reviews

(3 reviews total)
Пока всего не прочитал.
GPU programming has been taking off. This has had the negative impact of killing Graphics card prices. But the application of large scale compute engines will have broader impact than bitcoin.
Ótimo site. Compra rapida.

Recommended For You

Book Title
Unlock this book and the full library for FREE
Start free trial