Introduction to Parallel Programming and CUDA with Sample Code

Exclusive offer: get 50% off this eBook here
WCF 4.0 Multi-tier Services Development with LINQ to Entities

WCF 4.0 Multi-tier Services Development with LINQ to Entities — Save 50%

Build SOA applications on the Microsoft platform with this hands-on WCF 4.0 book and eBook guide updated for VS2010

$29.99    $15.00
by Daniel Bolella | September 2010 | Architecture & Analysis Enterprise Articles Microsoft

Parallel programming is becoming one of the hottest topics in software today as multi-core CPUs decrease in price and increase in power. Parallelism in programs allows multiple processes to be executed concurrently using separate threads and processing units. This is appealing to developers and users alike, because it can help reduce runtimes while still producing the same results as if it were run in serial. This article by Daniel Bolella gives us an overview to Parallel Programming and CUDA and is accompanied with a sample code to present the concept.

To give an example, let’s say we have an array that contains thousands of floating-point integers and each value needs to be run through a lengthy algorithm. Instead of running each value through the algorithm consecutively (i.e. one at a time), parallelism allows multiple values to be processed simultaneously (i.e. running many values through the algorithm at the same time), reducing overall processing time and producing fast and accurate results.

There are some restrictions with using parallelism and not every program can be done in parallel. For instance, let’s say we have that same program from before but this time as we process a value we want to then check the currently processed value against all the previously calculated values in the array, before going to the next. We can confidently say all previous values in the array have been processed and are available to be accessed for the check. If we tried to do this in parallel, we could have incorrect data because multiple values are calculated at the same time and some may be ready for checking while others are not. Extra checks and steps are needed to prevent these types of concurrency issues. However, the results could still prove to be worth the extra steps.

One of the major breakthroughs in parallel programming technology today goes beyond the scope of just multi-core CPU’s. Although they do offer a lot more power and potential than single-core units, another common computer component, the GPU, offers even more power, and NVIDIA’s flagship product, called CUDA, offers this technology to all developers easily and for free.

CUDA was developed by NVIDIA to provide simple access to GPGPU (General-Purpose Computation on Graphics Processing Units) and parallel computing on their own GPU’s. The logic behind the idea is that GPU’s have much more processing power than CPU’s and have numerous cores that operate in parallel to run intensive graphics operations. By allowing developers to utilize this power for their own projects, it can create fast solutions to some heavy and time-consuming programs, specifically those that run the same process recursively and independently of other processes.

The learning curve is not very steep for most developers. CUDA accomplishes making GPGPU easily usable by adding functionality to the standard C and C++ programming languages. This allows for fast adoption by almost any programmer and helps with cross-platform integration.

To get started with CUDA, you will need a recent NVIDIA GPU (Geforce 8 series and beyond, or you can check on the NVIDIA website to see which GPU’s are CUDA enabled). CUDA works on Windows, Mac OSX, and certain Linux distributions. You will need to download and install the developer drivers, the CUDA toolkit, and the CUDA SDK off the Nvidia website, respectively.

NVIDIA provides an installation guide on their website which provides more details about the installation process, as well as a method of checking the installation to see if it is working.

WCF 4.0 Multi-tier Services Development with LINQ to Entities Build SOA applications on the Microsoft platform with this hands-on WCF 4.0 book and eBook guide updated for VS2010
Published: June 2010
eBook Price: $29.99
Book Price: $49.99
See more
Select your format and quantity:

Read more about this book

Now to write and run our first CUDA program in C! Open a text editor and copy and paste the following code:

#include <stdio.h>
#include <cuda.h>

__global__ void mult_vect(float * x, float * y, float * z, int n)
int idx= blockIdx.x * blockDim.x + threadIdx.x;
if(idx < n)
z[idx] = x[idx] * y[idx];

int main()
float *x_h, *y_h, *z_h;
float *x_d, *y_d, *z_d;
int n= 20;
size_t size= n * sizeof(float);
x_h= (float *)malloc(size);
y_h= (float *)malloc(size);
z_h= (float *)malloc(size);
int i;
for(i=0; i < n; i++)
x_h[i]= (float) i;
y_h[i]= (float) i;
cudaMemcpy(x_d, x_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(y_d, y_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(z_d, z_h, size, cudaMemcpyHostToDevice);
int block_size= 4;
int num_blocks= (n + block_size - 1) / block_size;
mult_vect <<<num_blocks, block_size>>> (x_d, y_d, z_d, n);
cudaMemcpy(x_h, x_d, size, cudaMemcpyDeviceToHost);
cudaMemcpy(y_h, y_d, size, cudaMemcpyDeviceToHost);
cudaMemcpy(z_h, z_d, size, cudaMemcpyDeviceToHost);
for(i = 0; i < n ; i++)
printf("%f ", z_h[i]);
if(z_h[i] == (x[i] * y[i]))


You will notice that there are a lot of similarities to the regular C language, with a few exceptions. Let’s first look at the main. Here, we create an array on the local host computer as well as the device (specified by the _h and _d suffixes). Because this is a simple example, we will then fill the x_h and y_h array with floats from 0 to n. Notice we only fill the local variables. CUDA does not allow the device to use anything in local memory. The values must first be created in local memory and then copied into the device’s memory to be used during device processes. Later, we will need to copy the processed memory in the device back to the local memory so it can be used for local processes. So, to copy the host variables to the device variables, we use cudaMemcpy() and transfer the values from the local memory to the device.

Next, block sizes and the number of blocks to be used during the GPU runtime must be calculated. A block size specifies how many threads to use per block, and the number of blocks is related to the number of processes that can be run at once. For more advanced parallel performance, the number of blocks and their sizes could be refined to produce even faster results. However, for this simple exercise, we will just specify the block size as 4 threads and the number of blocks will be calculated from that, which is (20 + 4 -1)/4 = 23/4 = 5.

Now we call our CUDA function to process the array using the GPU. Let’s examine what the function looks like and how it works.

You’ll notice that our CUDA function is declared with "__global__ void". This states that it is a CUDA function called by the host to then be run by the GPU. (Note: A function with a "__device__" prefix is called by the device to be run on the device. I will expand upon this later on.) Our function takes in the three float arrays and our array lengths, n. The function first calculates the current index by taking the block ID, the block Dim, and the thread ID and multiplying them. If the index calculated is within n, we will then multiply x and y and insert it into z. After all the indexes have completed, the program will then exit the device and return to the local main.

Back in the main, we copy back the new values from the device to the host using cudaMemcpy again. Then, we run a loop to display the results from the device and check whether the values are correct. Lastly, we free all of the variables using free() for host variables and cudaFree() for device variables.

Now to run the code! Save the file as a .cu and open the command prompt/terminal. Go to the directory where CUDA was installed. We will now compile the code by typing in "nvcc <filename>.cu –o <filename>", and after it has compiled, type "<filename>.exe" (Windows) or "./<filename>" (Linux).

If the program runs and the values display all successful, then that’s it! You are now setup for GPGPU and parallel programming on your computer!

Now that we’ve gone over the basics of how CUDA looks and works, the last thing to go over is CUDA functions. Functions in CUDA are similar to functions in C or C++, but do have distinctions that identify the type of CUDA functions they are. As mentioned earlier, they are distinguished with the "__global__" and "__device__" prefixes. Both are run on the device during runtime. The difference between the two is that "__global__" functions are called by the local host (usually in the main) and "__device__" functions are called by other functions run on the device, including both "__global__" and "__device__" prefixes. So in our example, we could actually create a "__device__" function that multiplies x and y, returns the value and then in our "__global__" function set z equal to the result.


There are many more things to talk about and discuss with CUDA, but they are beyond the scope of this article. My hopes for this introduction are that you learned what CUDA is, learned how it can be used, and that it will come in handy for any intensive programs you need to run in the future. For more source code, examples, and documentation, I recommend visiting NVIDIA website where there is tons of support and an elaborate showcase of what others of have done with the power of CUDA. Enjoy!

Microsoft Windows Workflow Foundation 4.0 Cookbook Over 70 recipes with hands-on, ready to implement solutions for authoring Microsoft Windows Workflow Foundation 4.0 workflows with this book and eBook
Published: September 2010
eBook Price: $29.99
Book Price: $49.99
See more
Select your format and quantity:

About the Author :

Daniel Bolella is currently pursuing his Bachelor’s in Computer Science at Stevens Institute of Technology. He has been programming since his early teens and has explored many different areas of application, including systems programming, financial applications, web development, reverse engineering, and game development. While attending the National Youth Leadership Conference on Technology 2006 in San Jose, he led a team to first place for Best Social Impact in the Future Solutions Competition. Since attending college, he has had four major internships, including one at a Fortune 500 company and another at a Fortune 100 company, and has made numerous connections in the computer industry, as well as in other industries. He is always more than willing to take on new challenges and learn new things, as well as help teach others to the best of his ability. His blog on programming topics and personal projects can be found at

Books From Packt

Applied Architecture Patterns on the Microsoft Platform
Applied Architecture Patterns on the Microsoft Platform

Microsoft Visio 2010 Business Process Diagramming and Validation
Microsoft Visio 2010 Business Process Diagramming and Validation

Refactoring with Microsoft Visual Studio 2010
Refactoring with Microsoft Visual Studio 2010

Microsoft Silverlight 4 Data and Services Cookbook
Microsoft Silverlight 4 Data and Services Cookbook

Microsoft Visual C++ Windows Applications by Example
Microsoft Visual C++ Windows Applications by Example

C# 2008 and 2005 Threaded Programming: Beginner's Guide
C# 2008 and 2005 Threaded Programming: Beginner's Guide

Oracle Fusion Middleware Patterns
Oracle Fusion Middleware Patterns

Troux Enterprise Architecture Solutions
Troux Enterprise Architecture Solutions

Your rating: None Average: 5 (11 votes)
A bit Buggy by
I was compiling your code on linux and I found some little bugs: There's no memory allocated for variables x_d, y_d and z_d (cudaMalloc(&x_d, size);cudaMalloc(&x_y, size);cudaMalloc(&x_z, size);) in last for loop there is if(z_h[i] == (x[i] * y[i])) and should be if(z_h[i] == (x_h[i] * y_h[i])) (there are no variables x and y) Greetings Me

Post new comment

This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software