OpenCL Programming by Example

Chapter 1. Hello OpenCL

Parallel Computing has been extensively researched over the past few decades and had been the key research interest at many universities. Parallel Computing uses multiple processors or computers working together on a common algorithm or task. Due to the constraints in the available memory, performance of a single computing unit, and also the need to complete a task quickly, various parallel computing frameworks have been defined. All computers are parallel these days, even your handheld mobiles are multicore platforms and each of these parallel computers uses a parallel computing framework of their choice. Let's define Parallel Computing.

The Wikipedia definition says that, Parallel Computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently (in parallel).

There are many Parallel Computing programming standards or API specifications, such as OpenMP, OpenMPI, Pthreads, and so on. This book is all about OpenCL Parallel Programming. In this chapter, we will start with a discussion on different types of parallel programming. We will first introduce you to OpenCL with different OpenCL components. We will also take a look at the various hardware and software vendors of OpenCL and their OpenCL installation steps. Finally, at the end of the chapter we will see an OpenCL program example SAXPY in detail and its implementation.

Advances in computer architecture

All over the 20th century computer architectures have advanced by multiple folds. The trend is continuing in the 21st century and will remain for a long time to come. Some of these trends in architecture follow Moore's Law. "Moore's law is the observation that, over the history of computing hardware, the number of transistors on integrated circuits doubles approximately every two years". Many devices in the computer industry are linked to Moore's law, whether they are DSPs, memory devices, or digital cameras. All the hardware advances would be of no use if there weren't any software advances. Algorithms and software applications grow in complexity, as more and more user interaction comes into play. An algorithm can be highly sequential or it may be parallelized, by using any parallel computing framework. Amdahl's Law is used to predict the speedup for an algorithm, which can be obtained given n threads. This speedup is dependent on the value of the amount of strictly serial or non-parallelizable code (B). The time T(n) an algorithm takes to finish when being executed on n thread(s) of execution corresponds to:

T(n) = T(1) (B + (1-B)/n)

Therefore the theoretical speedup which can be obtained for a given algorithm is given by :

Speedup(n) = 1/(B + (1-B)/n)

Amdahl's Law has a limitation, that it does not fully exploit the computing power that becomes available as the number of processing core increase.

Gustafson's Law takes into account the scaling of the platform by adding more processing elements in the platform. This law assumes that the total amount of work that can be done in parallel, varies linearly with the increase in number of processing elements. Let an algorithm be decomposed into (a+b). The variable a is the serial execution time and variable b is the parallel execution time. Then the corresponding speedup for P parallel elements is given by:

(a + P*b)

Speedup = (a + P*b) / (a + b)

Now defining α as a/(a+b), the sequential execution component, as follows, gives the speedup for P processing elements:

Speedup(P) = P – α *(P - 1)

Given a problem which can be solved using OpenCL, the same problem can also be solved on a different hardware with different capabilities. Gustafson's law suggests that with more number of computing units, the data set should also increase that is, "fixed work per processor". Whereas Amdahl's law suggests the speedup which can be obtained for the existing data set if more computing units are added, that is, "Fixed work for all processors". Let's take the following example:

Let the serial component and parallel component of execution be of one unit each.

In Amdahl's Law the strictly serial component of code is B (equals 0.5). For two processors, the speedup T(2) is given by:

T(2) = 1 / (0.5 + (1 – 0.5) / 2) = 1.33

Similarly for four and eight processors, the speedup is given by:

T(4) = 1.6 and T(8) = 1.77

Adding more processors, for example when n tends to infinity, the speedup obtained at max is only 2. On the other hand in Gustafson's law, Alpha = 1(1+1) = 0.5 (which is also the serial component of code). The speedup for two processors is given by:

Speedup(2) = 2 – 0.5(2 - 1) = 1.5

Similarly for four and eight processors, the speedup is given by:

Speedup(4) = 2.5 and Speedup(8) = 4.5

The following figure shows the work load scaling factor of Gustafson's law, when compared to Amdahl's law with a constant workload:

Comparison of Amdahl's and Gustafson's Law

OpenCL is all about parallel programming, and Gustafson's law very well fits into this book as we will be dealing with OpenCL for data parallel applications. Workloads which are data parallel in nature can easily increase the data set and take advantage of the scalable platforms by adding more compute units. For example, more pixels can be computed as more compute units are added.

Different parallel programming techniques

There are several different forms of parallel computing such as bit-level, instruction level, data, and task parallelism. This book will largely focus on data and task parallelism using heterogeneous devices. We just now coined a term, heterogeneous devices. How do we tackle complex tasks "in parallel" using different types of computer architecture? Why do we need OpenCL when there are many (already defined) open standards for Parallel Computing?

To answer this question, let us discuss the pros and cons of different Parallel computing Framework.

OpenMP

OpenMP is an API that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran. It is prevalent only on a multi-core computer platform with a shared memory subsystem.

A basic OpenMP example implementation of the OpenMP Parallel directive is as follows:

#pragma omp parallel
{
  body;
}

When you build the preceding code using the OpenMP shared library, libgomp would expand to something similar to the following code:

void subfunction (void *data)
{
    use data;
    body;
}
     
setup data;
GOMP_parallel_start (subfunction, &data, num_threads);
subfunction (&data);
GOMP_parallel_end ();
void GOMP_parallel_start (void (*fn)(void *), void *data, unsigned num_threads)

The OpenMP directives make things easy for the developer to modify the existing code to exploit the multicore architecture. OpenMP, though being a great parallel programming tool, does not support parallel execution on heterogeneous devices, and the use of a multicore architecture with shared memory subsystem does not make it cost effective.

MPI

Message Passing Interface (MPI) has an advantage over OpenMP, that it can run on either the shared or distributed memory architecture. Distributed memory computers are less expensive than large shared memory computers. But it has its own drawback with inherent programming and debugging challenges. One major disadvantage of MPI parallel framework is that the performance is limited by the communication network between the nodes.

Supercomputers have a massive number of processors which are interconnected using a high speed network connection or are in computer clusters, where computer processors are in close proximity to each other. In clusters, there is an expensive and dedicated data bus for data transfers across the computers. MPI is extensively used in most of these compute monsters called supercomputers.

OpenACC

The OpenACC Application Program Interface (API) describes a collection of compiler directives to specify loops and regions of code in standard C, C++, and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs, and accelerators. OpenACC is similar to OpenMP in terms of program annotation, but unlike OpenMP which can only be accelerated on CPUs, OpenACC programs can be accelerated on a GPU or on other accelerators also. OpenACC aims to overcome the drawbacks of OpenMP by making parallel programming possible across heterogeneous devices. OpenACC standard describes directives and APIs to accelerate the applications. The ease of programming and the ability to scale the existing codes to use the heterogeneous processor, warrantees a great future for OpenACC programming.

CUDA

Compute Unified Device Architecture (CUDA) is a parallel computing architecture developed by NVIDIA for graphics processing and GPU (General Purpose GPU) programming. There is a fairly good developer community following for the CUDA software framework. Unlike OpenCL, which is supported on GPUs by many vendors and even on many other devices such as IBM's Cell B.E. processor or TI's DSP processor and so on, CUDA is supported only for NVIDIA GPUs. Due to this lack of generalization, and focus on a very specific hardware platform from a single vendor, OpenCL is gaining traction.

CUDA or OpenCL?

CUDA is more proprietary and vendor specific but has its own advantages. It is easier to learn and start writing code in CUDA than in OpenCL, due to its simplicity. Optimization of CUDA is more deterministic across a platform, since less number of platforms are supported from a single vendor only. It has simplified few programming constructs and mechanisms. So for a quick start and if you are sure that you can stick to one device (GPU) from a single vendor that is NVIDIA, CUDA can be a good choice.

OpenCL on the other hand is supported for many hardware from several vendors and those hardware vary extensively even in their basic architecture, which created the requirement of understanding a little complicated concepts before starting OpenCL programming. Also, due to the support of a huge range of hardware, although an OpenCL program is portable, it may lose optimization when ported from one platform to another.

The kernel development where most of the effort goes, is practically identical between the two languages. So, one should not worry about which one to choose. Choose the language which is convenient. But remember your OpenCL application will be vendor agnostic. This book aims at attracting more developers to OpenCL.

There are many libraries which use OpenCL programming for acceleration. Some of them are MAGMA, clAMDBLAS, clAMDFFT, BOLT C++ Template library, and JACKET which accelerate MATLAB on GPUs. Besides this, there are C++ and Java bindings available for OpenCL also.

Once you've figured out how to write your important "kernels" it's trivial to port to either OpenCL or CUDA. A kernel is a computation code which is executed by an array of threads. CUDA also has a vast set of CUDA accelerated libraries, that is, CUBLAS, CUFFT, CUSPARSE, Thrust and so on. But it may not take a long time to port these libraries to OpenCL.

Renderscripts

Renderscripts is also an API specification which is targeted for 3D rendering and general purpose compute operations in an Android platform. Android apps can accelerate the performance by using these APIs. It is also a cross-platform solution. When an app is run, the scripts are compiled into a machine code of the device. This device can be a CPU, a GPU, or a DSP. The choice of which device to run it on is made at runtime. If a platform does not have a GPU, the code may fall back to the CPU. Only Android supports this API specification as of now. The execution model in Renderscripts is similar to that of OpenCL.

Hybrid parallel computing model

Parallel programming models have their own advantages and disadvantages. With the advent of many different types of computer architectures, there is a need to use multiple programming models to achieve high performance. For example, one may want to use MPI as the message passing framework, and then at each node level one might want to use, OpenCL, CUDA, OpenMP, or OpenACC.

Besides all the above programming models many compilers such as Intel ICC, GCC, and Open64 provide auto parallelization options, which makes the programmers job easy and exploit the underlying hardware architecture without the need of knowing any parallel computing framework. Compilers are known to be good at providing instruction-level parallelism. But tackling data level or task level auto parallelism has its own limitations and complexities.

Introduction to OpenCL

OpenCL standard was first introduced by Apple, and later on became part of the open standards organization "Khronos Group". It is a non-profit industry consortium, creating open standards for the authoring, and acceleration of parallel computing, graphics, dynamic media, computer vision and sensor processing on a wide variety of platforms and devices.

The goal of OpenCL is to make certain types of parallel programming easier, and to provide vendor agnostic hardware-accelerated parallel execution of code. OpenCL (Open Computing Language) is the first open, royalty-free standard for general-purpose parallel programming of heterogeneous systems. It provides a uniform programming environment for software developers to write efficient, portable code for high-performance compute servers, desktop computer systems, and handheld devices using a diverse mix of multi-core CPUs, GPUs, and DSPs.

OpenCL gives developers a common set of easy-to-use tools to take advantage of any device with an OpenCL driver (processors, graphics cards, and so on) for the processing of parallel code. By creating an efficient, close-to-the-metal programming interface, OpenCL will form the foundation layer of a parallel computing ecosystem of platform-independent tools, middleware, and applications.

We mentioned vendor agnostic, yes that is what OpenCL is about. The different vendors here can be AMD, Intel, NVIDIA, ARM, TI, and so on. The following diagram shows the different vendors and hardware architectures which use the OpenCL specification to leverage the hardware capabilities:

The heterogeneous system

The OpenCL framework defines a language to write "kernels". These kernels are functions which are capable of running on different compute devices. OpenCL defines an extended C language for writing compute kernels, and a set of APIs for creating and managing these kernels. The compute kernels are compiled with a runtime compiler, which compiles them on-the-fly during host application execution for the targeted device. This enables the host application to take advantage of all the compute devices in the system with a single set of portable compute kernels.

Based on your interest and hardware availability, you might want to do OpenCL programming with a "host and device" combination of "CPU and CPU" or "CPU and GPU". Both have their own programming strategy. In CPUs you can run very large kernels as the CPU architecture supports out-of-order instruction level parallelism and have large caches. For the GPU you will be better off writing small kernels for better performance. Performance optimization is a huge topic in itself. We will try to discuss this with a case study in Chapter 8, Basic Optimization Techniques with Case Study

Hardware and software vendors

There are various hardware vendors who support OpenCL. Every OpenCL vendor provides OpenCL runtime libraries. These runtimes are capable of running only on their specific hardware architectures. Not only across different vendors, but within a vendor there may be different types of architectures which might need a different approach towards OpenCL programming. Now let's discuss the various hardware vendors who provide an implementation of OpenCL, to exploit their underlying hardware.

Advanced Micro Devices, Inc. (AMD)

With the launch of AMD A Series APU, one of industry's first Accelerated Processing Unit (APU), AMD is leading the efforts of integrating both the x86_64 CPU and GPU dies in one chip. It has four cores of CPU processing power, and also a four or five graphics SIMD engine, depending on the silicon part which you wish to buy. The following figure shows the block diagram of AMD APU architecture:

An AMD GPU consist of a number of Compute Engines (CU) and each CU has 16 ALUs. Further, each ALU is a VLIW4 SIMD processor and it could execute a bundle of four or five independent instructions. Each CU could be issued a group of 64 work items which form the work group (wavefront). AMD Radeon ^™ HD 6XXX graphics processors uses this design. The following figure shows the HD 6XXX series Compute unit, which has 16 SIMD engines, each of which has four processing elements:

Starting with the AMD Radeon HD 7XXX series of graphics processors from AMD, there were significant architectural changes. AMD introduced the new Graphics Core Next (GCN) architecture. The following figure shows an GCN compute unit which has 4 SIMD engines and each engine is 16 lanes wide:

A group of these Compute Units forms an AMD HD 7xxx Graphics Processor. In GCN, each CU includes four separate SIMD units for vector processing. Each of these SIMD units simultaneously execute a single operation across 16 work items, but each can be working on a separate wavefront.

Apart from the APUs, AMD also provides discrete graphics cards. The latest family of graphics card, HD 7XXX, and beyond uses the GCN architecture. We will discuss one of the discrete GPU architectures in the following chapter, where we will discuss the OpenCL Platform model. AMD also provides the OpenCL runtimes for their CPU devices.

NVIDIA^®

One of NVIDIA GPU architectures is codenamed "Kepler". GeForce^® GTX 680 is one Kepler architectural silicon part. Each Kepler GPU consists of different configurations of Graphics Processing Clusters (GPC) and streaming multiprocessors. The GTX 680 consists of four GPCs and eight SMXs as shown in the following figure:

NVIDIA Kepler architecture—GTX 680, © NVIDIA^®

Kepler architecture is part of the GTX 6XX and GTX 7XX family of NVIDIA discrete cards. Prior to Kepler, NVIDIA had Fermi architecture which was part of the GTX 5XX family of discrete and mobile graphic processing units.

Intel^®

Intel's OpenCL implementation is supported in the Sandy Bridge and Ivy Bridge processor families. Sandy Bridge family architecture is also synonymous with the AMD's APU. These processor architectures also integrated a GPU into the same silicon as the CPU by Intel. Intel changed the design of the L3 cache, and allowed the graphic cores to get access to the L3, which is also called as the last level cache. It is because of this L3 sharing that the graphics performance is good in Intel. Each of the CPUs including the graphics execution unit is connected via Ring Bus. Also each execution unit is a true parallel scalar processor. Sandy Bridge provides the graphics engine HD 2000, with six Execution Units (EU), and HD 3000 (12 EU), and Ivy Bridge provides HD 2500(six EU) and HD 4000 (16 EU). The following figure shows the Sandy bridge architecture with a ring bus, which acts as an interconnect between the cores and the HD graphics:

Intel Sandy Bridge architecture—© Intel^®

ARM Mali^™ GPUs

ARM also provides GPUs by the name of Mali Graphics processors. The Mali T6XX series of processors come with two, four, or eight graphics cores. These graphic engines deliver graphics compute capability to entry level smartphones, tablets, and Smart TVs. The below diagram shows the Mali T628 graphics processor.

ARM Mali—T628 graphics processor, © ARM

Mali T628 has eight shader cores or graphic cores. These cores also support Renderscripts APIs besides supporting OpenCL.

Besides the four key competitors, companies such as TI (DSP), Altera (FPGA), and Oracle are providing OpenCL implementations for their respective hardware. We suggest you to get hold of the benchmark performance numbers of the different processor architectures we discussed, and try to compare the performance numbers of each of them. This is an important first step towards comparing different architectures, and in the future you might want to select a particular OpenCL platform based on your application workload.

OpenCL components

Before delving into the programming aspects in OpenCL, we will take a look at the different components in an OpenCL framework. The first thing is the OpenCL specification. The OpenCL specification describes the OpenCL programming architecture details, and a set of APIs to perform specific tasks, which are all required by an application developer. This specification is provided by the Khronos OpenCL consortium. Besides this, Khronos also provides OpenCL header files. They are cl.h, cl_gl.h, cl_platform.h, and so on.

An application programmer uses these header files to develop his application and the host compiler links with the OpenCL.lib library on Windows. This library contains the entry points for the runtime DLL OpenCL.dll. On Linux the application program is linked dynamically with the libOpenCL.so shared library. The source code for the OpenCL.lib file is also provided by Khronos. The different OpenCL vendors shall redistribute this OpenCL.lib file and package it along with their OpenCL development SDK. Now the application is ready to be deployed on different platforms.

The different components in OpenCL are shown in the following figure:

Different components in OpenCL

On Windows, at runtime the application first loads the OpenCL.dll dynamic link library which in turn, based on the platform selected, loads the appropriate OpenCL runtime driver by reading the Windows registry entry for the selected platform (either of amdocl.dll or any other vendor OpenCL runtimes). On Linux, at runtime the application loads the libOpenCL.so shared library, which in turn reads the file /etc/OpenCL/vendors/*.icd and loads the library for the selected platform. There may be multiple runtime drivers installed, but it is the responsibility of the application developers to choose one of them, or if there are multiple devices in the platforms, he may want to choose all the available platforms. During runtime calls to OpenCL, functions queue parallel tasks on OpenCL capable devices. We will discuss more on OpenCL Runtimes in Chapter 5, OpenCL Program and Kernel Objects.

An example of OpenCL program

In this section we will discuss all the necessary steps to run an OpenCL application.

Basic software requirements

A person involved in OpenCL programming should be very proficient in C programming, and having prior experience in any parallel programming tool will be an added advantage. He or she should be able to break a large problem and find out the data and task parallel regions of the code which he or she is trying to accelerate using OpenCL. An OpenCL programmer should know the underlying architecture for which he/she is trying to program. If you are porting an existing parallel code into OpenCL, then you just need to start learning the OpenCL programming architecture.

Besides this a programmer should also have the basic system software details, such as compiling the code and linking it to an appropriate 32 bit or 64 bit library. He should also have knowledge of setting the system path on Windows to the correct DLLs or set the LD_LIBRARY_PATH environment variable in Linux to the correct shared libraries.

The common system requirements for Windows and Linux operating systems are as follows:

Windows

You should have administrative privileges on the system
Microsoft Windows XP, Vista, or 7
Microsoft Visual Studio 2005, 2008, or 2010
Display Drivers for AMD and NVIDIA GPUs. For NVIDIA GPUs you will need display drivers R295 or R300 and above

Linux

You should have root permissions to install the SDK
With the vast number of flavors of Linux, practically any supported version which has the corresponding graphic device driver installed for the GPU

The GCC compiler tool chain

Installing and setting up an OpenCL compliant computer

To install OpenCL you need to download an implementation of OpenCL. We discussed about the various hardware and software vendors in a previous section. The major graphic vendors, NVIDIA and AMD have both released implementations of OpenCL for their GPUs. Similarly AMD and Intel provide a CPU-only runtime for OpenCL. OpenCL implementations are available in so-called Software Development Kits (SDK), and often include some useful tools such as debuggers and profilers. The next step is to download and install the SDK for the GPU you have on your computer. Note that not all graphic cards are supported. A list of which graphics cards are supported can be found in the respective vendor specific websites. Also you can take a look at the Khronos OpenCL conformance products list. If you don't have a graphics card, don't worry, you can use your existing processor to run OpenCL samples on CPU as a device.

If you are still confused about which device to choose, then take a look at the list of supported devices provided with each release of an OpenCL SDK from different vendors.

Installation steps

For NVIDIA installation steps, we suggest you to take a look at the latest installation steps for the CUDA software. First install the GPU computing SDK provided for the OS. The following link provides the installation steps for NVIDIA platforms:
http://developer.download.nvidia.com/compute/cuda/3_2_prod/sdk/docs/OpenCL_Release_Notes.txt
For AMD Accelerated Parallel Processing (APP) SDK installation take a look at the AMD APP SDK latest version installation guide. The AMD APP SDK comes with a huge set of sample programs which can be used for running. The following link is where you will find the latest APP SDK installation notes:
http://developer.amd.com/download/AMD_APP_SDK_Installation_Notes.pdf
For INTEL SDK for OpenCL applications 2013, use the steps provided in the following link:
http://software.intel.com/en-us/articles/intel-sdk-for-opencl-applications-2013-release-notes

Note these links are subject to change over a period of time.

AMD's OpenCL implementation is OpenCL 1.2 conformant. Also download the latest AMD APP SDK version 2.8 or above.

For NVIDIA GPU computing, make sure you have a CUDA enabled GPU. Download the latest CUDA release 4.2 or above, and the GPU computing SDK release 4.2 or above.

For Intel, download the Intel SDK for OpenCL Applications 2013.

We will briefly discuss the installation steps. The installation steps may vary from vendor to vendor. Hence we discuss only AMD's and NVIDIA's installation steps. Note that NVIDIA's CUDA only supports GPU as the device. So we suggest that if you have a non NVIDIA GPU then it would be better that you install AMD APP SDK, as it supports both the AMD GPUs and CPUs as the device. One can have multiple vendor SDKs also installed. This is possible as the OpenCL specification allows runtime selection of the OpenCL platform. This is referred to as the ICD (Installable Client Driver) dispatch mechanism. We will discuss more about this in a later chapter.

Installing OpenCL on a Linux system with an AMD graphics card

Make sure you have root privileges and remove all previous installations of APP SDK.
Untar the downloaded SDK.
Run the Install Script Install-AMD-APP.sh.
This will install the developer binary, and samples in folder /opt/AMPAPP/.
Make sure the variables AMDAPPSDKROOT and LD_LIBRARY_PATH are set to the locations where you have installed the APP SDK.

For latest details you can refer to the Installation Notes provided with the APP SDK. Linux distributions such as Ubuntu, provide an OpenCL distribution package for vendors such as AMD and NVIDIA. You can use the following command to install the OpenCL runtimes for AMD:

sudo apt-get install amd-opencl-dev

For NVIDIA you can use the following command:

sudo apt-get install nvidia-opencl-dev

Note that amd-opencl-dev installs both the CPU and GPU OpenCL implementations.

Installing OpenCL on a Linux system with an NVIDIA graphics card

Delete any previous installations of CUDA.
Make sure you have the CUDA supported version of Linux, and run lspci to check the video adapter which the system uses. Download and install the corresponding display driver.
Install the CUDA toolkit which contains the tools needed to compile and build a CUDA application.
Install the GPU computing SDK. This includes sample projects and other resources for constructing CUDA programs.

You system is now ready to compile and run any OpenCL code.

Installing OpenCL on a Windows system with an AMD graphics card

Download the AMD APP SDK v2.7 and start installation.
Follow the onscreen prompts and perform an express installation.
This installs the AMD APP samples, runtime, and tools such as the APP Profiler and APP Kernel Analyser.
The express installation sets up the environment variables AMDAPPSDKROOT and AMDAPPSDKSAMPLESROOT.
If you select custom install then you will need to set the environment variables to the appropriate path.

Go to the samples directory and build the OpenCL samples, using the Microsoft Visual Studio.

Installing OpenCL on a Windows system with an NVIDIA graphics card

Uninstall any previous versions of the CUDA installation.
CUDA 4.2 or above release toolkit requires version R295, R300, or newer of the Windows Vista or Windows XP NVIDIA display driver.
Make sure you install the display driver and then proceed to the installation.
Install the Version 4.2 release of the NVIDIA CUDA toolkit cudatoolkit_4.2_Win_[32|64].exe.
Install the Version 4.2 release of the NVIDIA GPU computing SDK by running gpucomputingsdk_4.2_Win_[32|64].exe.

Verify the installation by compiling and running some sample codes.

Apple OSX

Apple also provides an OpenCL implementation. You will need XCode developer tool to be installed. Xcode is a complete tool set for building OSX and iOS applications. For more information on building OpenCL application on OSX visit at the following link:

https://developer.apple.com/library/mac/documentation/Performance/Conceptual/OpenCL_MacProgGuide/Introduction/Introduction.html

Multiple installations

As we have stated earlier, there can be multiple installations of OpenCL in a system. This is possible in OpenCL standard, because all OpenCL applications are linked using a common library called the OpenCL ICD library. Each OpenCL vendor, ships this library and the corresponding OpenCL.dll or libOpenCL.so library in its SDK. This library contains the mechanism to select the appropriate vendor-specific runtimes during runtime. The application developer makes this selection. Let's explain this with an example installation of an AMD and Intel OpenCL SDK. In the following screenshot of the Windows Registry Editor you can see two runtime DLLs. It is one of these libraries which is loaded by the OpenCL.dll library, based on the application developers selection. The following shows the Regedit entry with AMD and Intel OpenCL installations:

Registry Editor screenshot, showing multiple installations

During runtime, the OpenCL.dll library will read the registry details specific to HKEY_LOCAL_MACHINE\SOFTWARE\Khronos (or libOpenCL.so in Linux, will read the value of the vendor-specific library in the ICD file in folder /etc/OpenCL/vendors/*.icd), loads the appropriate library, and assigns the function pointers to the loaded library. An application developer can consider OpenCL.dll or libOpenCL.so as the wrapper around different OpenCL vendor libraries. This makes the application developers life easy and he can link it with OpenCL.lib or libOpenCL.so during link time, and distribute it with his application. This allows the application developer to ship his code for different OpenCL vendors/implementations easily.

Implement the SAXPY routine in OpenCL

SAXPY can be called the "Hello World" of OpenCL. In the simplest terms, the first OpenCL sample shall compute A = alpha*B + C, where alpha is a constant and A, B, and C are vectors of an arbitrary size n. In linear algebra terms, this operation is called SAXPY (Single precision real Alpha X plus Y). You might have understood by now, that each multiplication and addition operation is independent of the other. So this is a data parallel problem.

A simple C program would look something like the following code:

void saxpy(int n, float a, float *b, float *c)
{
  for (int i = 0; i < n; ++i)
    y[i] = a*x[i] + y[i];
}

OpenCL code

An OpenCL code consists of the host code and the device code. The OpenCL kernel code is highlighted in the following code. This is the code which is compiled at run time and runs on the selected device. The following sample code computes A = alpha*B + C, where A, B, and C are vectors (arrays) of size given by the VECTOR_SIZE variable:

#include <stdio.h>
#include <stdlib.h>
#ifdef __APPLE__
#include <OpenCL/cl.h>
#else
#include <CL/cl.h>
#endif
#define VECTOR_SIZE 1024

//OpenCL kernel which is run for every work item created.
const char *saxpy_kernel =
"__kernel                                   \n"
"void saxpy_kernel(float alpha,     \n"
"                  __global float *A,       \n"
"                  __global float *B,       \n"
"                  __global float *C)       \n"
"{                                          \n"
"    //Get the index of the work-item       \n"
"    int index = get_global_id(0);          \n"
"    C[index] = alpha* A[index] + B[index]; \n"
"}                                          \n";

int main(void) {
  int i;
  // Allocate space for vectors A, B and C
  float alpha = 2.0;
  float *A = (float*)malloc(sizeof(float)*VECTOR_SIZE);
  float *B = (float*)malloc(sizeof(float)*VECTOR_SIZE);
  float *C = (float*)malloc(sizeof(float)*VECTOR_SIZE);
  for(i = 0; i < VECTOR_SIZE; i++)
  {
    A[i] = i;
    B[i] = VECTOR_SIZE - i;
    C[i] = 0;
  }

  // Get platform and device information
  cl_platform_id * platforms = NULL;
  cl_uint     num_platforms;
  //Set up the Platform
  cl_int clStatus = clGetPlatformIDs(0, NULL, &num_platforms);
  platforms = (cl_platform_id *)
  malloc(sizeof(cl_platform_id)*num_platforms);
  clStatus = clGetPlatformIDs(num_platforms, platforms, NULL);

  //Get the devices list and choose the device you want to run on
  cl_device_id     *device_list = NULL;
  cl_uint           num_devices;

  clStatus = clGetDeviceIDs( platforms[0], CL_DEVICE_TYPE_GPU, 0,NULL, &num_devices);
  device_list = (cl_device_id *) 
  malloc(sizeof(cl_device_id)*num_devices);
  clStatus = clGetDeviceIDs( platforms[0],CL_DEVICE_TYPE_GPU, num_devices, device_list, NULL);

  // Create one OpenCL context for each device in the platform
  cl_context context;
  context = clCreateContext( NULL, num_devices, device_list, NULL, NULL, &clStatus);

  // Create a command queue
  cl_command_queue command_queue = clCreateCommandQueue(context, device_list[0], 0, &clStatus);

  // Create memory buffers on the device for each vector
  cl_mem A_clmem = clCreateBuffer(context, CL_MEM_READ_ONLY,VECTOR_SIZE * sizeof(float), NULL, &clStatus);
  cl_mem B_clmem = clCreateBuffer(context, CL_MEM_READ_ONLY,VECTOR_SIZE * sizeof(float), NULL, &clStatus);
  cl_mem C_clmem = clCreateBuffer(context, CL_MEM_WRITE_ONLY,VECTOR_SIZE * sizeof(float), NULL, &clStatus);

  // Copy the Buffer A and B to the device
  clStatus = clEnqueueWriteBuffer(command_queue, A_clmem, CL_TRUE, 0, VECTOR_SIZE * sizeof(float), A, 0, NULL, NULL);
  clStatus = clEnqueueWriteBuffer(command_queue, B_clmem, CL_TRUE, 0, VECTOR_SIZE * sizeof(float), B, 0, NULL, NULL);

  // Create a program from the kernel source
  cl_program program = clCreateProgramWithSource(context, 1,(const char **)&saxpy_kernel, NULL, &clStatus);

  // Build the program
  clStatus = clBuildProgram(program, 1, device_list, NULL, NULL, NULL);

  // Create the OpenCL kernel
  cl_kernel kernel = clCreateKernel(program, "saxpy_kernel", &clStatus);

  // Set the arguments of the kernel
  clStatus = clSetKernelArg(kernel, 0, sizeof(float), (void *)&alpha);
  clStatus = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&A_clmem);
  clStatus = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&B_clmem);
  clStatus = clSetKernelArg(kernel, 3, sizeof(cl_mem), (void *)&C_clmem);

  // Execute the OpenCL kernel on the list
  size_t global_size = VECTOR_SIZE; // Process the entire lists
  size_t local_size = 64;           // Process one item at a time
  clStatus = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_size, &local_size, 0, NULL, NULL);

  // Read the cl memory C_clmem on device to the host variable C
  clStatus = clEnqueueReadBuffer(command_queue, C_clmem, CL_TRUE, 0, VECTOR_SIZE * sizeof(float), C, 0, NULL, NULL);

  // Clean up and wait for all the comands to complete.
  clStatus = clFlush(command_queue);
  clStatus = clFinish(command_queue);

  // Display the result to the screen
  for(i = 0; i < VECTOR_SIZE; i++)
    printf("%f * %f + %f = %f\n", alpha, A[i], B[i], C[i]);

  // Finally release all OpenCL allocated objects and host buffers.
  clStatus = clReleaseKernel(kernel);
  clStatus = clReleaseProgram(program);
  clStatus = clReleaseMemObject(A_clmem);
  clStatus = clReleaseMemObject(B_clmem);
  clStatus = clReleaseMemObject(C_clmem);
  clStatus = clReleaseCommandQueue(command_queue);
  clStatus = clReleaseContext(context);
  free(A);
  free(B);
  free(C);
  free(platforms);
  free(device_list);
  return 0;
}

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you have purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

The preceding code can be compiled on command prompt using the following command:

Linux:

gcc -I $(AMDAPPSDKROOT)/include -L $(AMDAPPSDKROOT)/lib -lOpenCL saxpy.cpp -o saxpy
./saxpy

Windows:

cl /c saxpy.cpp /I"%AMDAPPSDKROOT%\include"
link  /OUT:"saxpy.exe" "%AMDAPPSDKROOT%\lib\x86_64\OpenCL.lib" saxpy.obj
saxpy.exe

If everything is successful, then you will be able to see the result of SAXPY being printed in the terminal. For more ease in compiling the code for different OS platforms and different OpenCL vendors, we distribute the examples in this book with a CMAKE build script. Refer to the documentation of building the samples using the CMAKE build uitility.

By now you should be able to install an OpenCL implementation which your hardware supports. You can now compile and run any OpenCL sample code, on any OpenCL compliant device. You also learned the various parallel programming models and solved a data parallel problem of SAXPY computation.

Next you can try out some exercises on the existing code. Modify the existing program to take different matrix size inputs. Try to use a 2D matrix and perform a similar computation on the matrix.

OpenCL program flow

Every OpenCL code consists of the host-side code and the device code. The host code coordinates and queues the data transfer and kernel execution commands. The device code executes the kernel code in an array of threads called NDRange. An OpenCL C host code does the following steps:

Allocates memory for host buffers and initializes them.
Gets platform and device information. This is discussed in detail in Chapter 2, OpenCL Architecture.
Sets up the platform.
Gets the devices list and chooses the type of device you want to run on.
Creates an OpenCL context for the device.
Creates a command queue.
Creates memory buffers on the device for each vector.
Copies the Buffer A and B to the device.
Creates a program from the kernel source.
Builds the program and creates the OpenCL kernel.
Sets the arguments of the kernel.
Executes the OpenCL kernel on the device.
Reads back the memory from the device to the host buffer. This step is optional, you may want to keep the data resident in the device for further processing.
Cleans up and waits for all the commands to complete.
Finally releases all OpenCL allocated objects and host buffers.

We will discuss the details of each step in the subsequent chapters. Platform and device selection, along with context and command queue creation will be discussed in Chapter 2, OpenCL Architecture. OpenCL buffers are integral parts of any OpenCL program. The creation of these buffers and transferring (copying) buffer data between the host and the device is discussed in Chapter 3, Buffers and Image Objects – Image Processing. Creating an OpenCL kernel object from an OpenCL program object, and setting the kernel arguments is discussed in Chapter 5, OpenCL Program and Kernel Objects.

Run on a different device

To make OpenCL run the kernel on the CPU, you can change the enum CL_DEVICE_TYPE_GPU to CL_DEVICE_TYPE_CPU in the call to clGetDeviceIDs. This shows how easy it is to make an OpenCL program run on different compute devices. The first sample source code is self-explanatory and each of the steps are commented. If you are running a multi GPU hardware system, then you will have to modify the code to use the appropriate device ID.

The OpenCL specification is described in terms of the following four models:

Platform model: This model specifies the host and device specification. The host-side code coordinates the execution of the kernels in the devices.
Memory model: This model specifies the global, local, private, and constant memory. The OpenCL specification describes the hierarchy of memory architecture, regardless of the underlying hardware.
Execution model: This model describes the runtime snapshot of the host and device code. It defines the work-items and how the data maps onto the work-items.
Programming model: The OpenCL programming model supports data parallel and task parallel programming models. This also describes the task synchronization primitives.

We will discuss each model in detail in Chapter 2, OpenCL Architecture.

Finally to conclude this chapter, General Purpose GPU Computing (GPGPU or just GPU computing) is undeniably a hot topic in this decade. We've seen diminishing results in CPU speeds in the past decade compared to the decade before that. Each successive manufacturing node presents greater challenges than the preceding one. The shrink in process technology is nearing an end, and we cannot expect exponential improvements in serial program execution. Hence, adding more cores to the CPU is the way to go, and thereby parallel programming. A popular law called Gustafson's law suggests that computations involving large data sets can be efficiently parallelized.

Filter reviews by

All

Amazon verified reviews

Robin T. Wernick Mar 18, 2014

Due to the war between the GPUs reaching a terminal phase this may be the last comprehensive book published for OpenCL. Nvidia's dominance in the field of GPU programming and the prevalence of CUDA code examples are overshadowing OpenCL.However, there are many systems out there still using AMD's products and the mobile market is still OpenCL accessible. The ARM chariman says so and Sony's Experia can use it. Also, Nexus 10 and 4 devices have drivers for it. So, tablet designers attention! If you want to manage eye-popping animations or twisty-curley 3D simulations or detector displays, then pay attention to this manual.You will have to search the internet and fire up your Android NDK in some cases to maximize your mobile device to manage a display of 10,000 airplane tracks on a map of the US or merge six voice channels into one conference call.For the rest of us, it's a piece of cake, drivers for all the Nvidia and ATI cards are available. Every computing device with a supported GPU can run this code. I even ran the OpenCL starfield simulation on my MSI portable at 112 times the speed of the CPU and it was running on an Nvidia GTX 570GTM internal chip!This book will show you how to take all those simulation ideas and run any computer with a decent graphics card 100 times faster than any current CPU can manage. The book covers all the aspects of using OpenCL from buffer loading, kernel creating and executing, to synchronization and performance evaluation.OK, at 277 pages, its a bit thin for the price. But, what is the cost of fame and fortune? So, buck up and get your head wrapped around this information. The sky is the limit to your imagination here.

Amazon Verified review

lefaucheux Jul 16, 2014

Couvre les besoins pour bien comprendre OpenCL et même plus ...Facile à assimiler que l'on soit développeurs avancés ou débutants

milesrf Jun 17, 2016

Looks like it will be useful once I get started with a compatible C++ compiler.

Paul T. Miller Mar 16, 2014

OpenCL Programming by Example, published by Packt Publishing, is geared toward the OpenCL beginner, so if you already have experience with OpenCL the book doesn’t offer much, though I did learn some nice tips and the OpenCL 1.2 information is useful.The book starts off with some nicely detailed background information on GPU architectures – no surprise as the authors work for AMD. Information is provided on getting the necessary OpenCL libraries and runtimes installed and read for use, so there shouldn’t be any confusion about how to get started. Then comes the nicely detail sample code, which is abundant and well documented.The book covers both basic and more complex GPU algorithms, including basic image processing with OpenCL images, computing histograms with reductions, implementing JPEG compression, and even goes into advanced case studies of speeding up curve-fitting and bitonic sort. The explanations and examples are very complete.I had a few minor issues with the book though. The early chapters are nicely edited but later on there are many grammatical errors and editing mistakes. A lot of boilerplate code is repeated in the examples; it would have been nice if the authors had developed a little wrapper or framework to hide some of the more mundane stuff. And as someone who uses OpenCL for image processing, I would like to have seen more examples in that area. The image processing examples were fairly mundane and well covered elsewhere.Gripes aside, I would definitely recommend this book to a programmer who wants to learn OpenCL. This is probably the most complete book I’ve read yet on the subject. Finally, I’d encourage everyone to check out Packt’s steadily growing library of technical books. The breadth of topics there will surprise you.

Liad Weinberger Apr 15, 2014

First and foremost, a proper disclosure: I was originally signed by Packt Publishing to write this book. Having failed to meet their schedule, Packt re-signed the book with the current authors and asked me to review the book during writing. I have reviewed the first couple of chapters and returned my comments, none of which were integrated in the book, nor have I received any additional chapters for review. My Amazon customer review may therefore seem biased, but I do hope the advised reader will take note and deduce, on her own accord, whether or not that is the case.And now my Amazon customer review:In the preface of the book, the section dubbed "Errata" starts with the following sentence: "Although we have taken every care to ensure the accuracy of our content, mistakes do happen". This is a rather standard statement in many technical books, and should be taken into account whenever a reader finds a mistake in the content. I find it hard to treat the book seriously though, having gone through the process of reviewing the first two chapters prior to the books publication, and ending up reading almost the exact same text in the published material, as what I have received for review and commented on.An example of such mistake that appears in the published book appears at the middle of chapter 1. In the section dubbed "Implement the SAXPY routine in OpenCL", the authors provide a code listing for an ANSI-C implementation of saxpy:void saxpy(int n, float a, float *b, float *c){ for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i];}I'm not trying to nitpick here, but this code will not compile under any C compiler, unless x and y are globally defined elsewhere, in which case the function will probably not accomplish the intended purpose. Of course it is clear that this is an editorial error, where x and y should have been b and c (or vise verse), but the fact that I commented on this while reviewing the chapter, and the error prevailed to the published version, sheds the promise of "we have taken every care to ensure the accuracy of our content" in a bad light.I would have been OK with the book if this was the only type of errors I found in it, but unfortunately they're not. The book contains a lot of information that newcomers to OpenCL, and to parallel programming in general will find interesting. It is too bad however that this information is flawed, and in some cases gives the impression that the authors, although undoubtedly used and are using OpenCL, fail to fully grasp some of the concepts or delicacies presented by the standard. Add to that the rather poor English grammar conveyed in the book (Packt's English editors were especially sloppy here), and you got yourself a recipe for producing misguided beginner programmers.Some examples are required to emphasis this claim. I'll give some from the 1st chapter, because if this early such mistakes are made, the following is of lesser importance.The 2nd paragraph in chapter 1 draws from Wikipedia's definition of parallel computing. It concludes with the following phrasing: "... large problems can often be divided into smaller ones, which are then solved concurrently (in parallel)". It is important to distinguish between the terms "parallel" and "concurrent". Although one would strive that her different "smaller" tasks run concurrently, that isn't always feasible, even as these are parallel tasks. Parallel is a logical term, describing the state of two (or more) task, having the theoretical ability of being executed at the same time. Concurrency is a physical term describing the target hardware's ability to execute parallel tasks. Parallel computing therefore, by no means, suggests that these parallel tasks will be executed concurrently.I will skip a few paragraphs, namely the rather odd list of parallel programming models in which the authors makes several errors regarding the "pros" and "cons" he chooses to list for each, and his decisions to put these models in comparison when they inherently target different levels of parallelism (local, heterogeneous systems, and distributed systems), and when these models are usually used in conjunction to one another, and complement one another, rather than compete.In the section dubbed "Introduction to OpenCL" the authors claim "OpenCL gives developers a common set of easy-to-use tools...". This is false. OpenCL provides two things, and two things only: an API (divided into two layers) which allows a host processor to manage and coordinate computations across different target devices within the system, and a programming language named OpenCL C, which the developer uses in order to write the code that will be executed on these devices. No tools. More than that, the authors are clearly confused as in a later section dubbed "OpenCL program flow" he falsely writes " An OpenCL C host code does the following...". As mentioned above, OpenCL C is used for programming the devices, whereas the host utilized the OpenCL API.I can actually list more problems found in chapter 1, but I hope that at this point you got the point. This actually gets worse in the following chapters where the authors describe the 4 models presented by the OpenCL standard. Just a small example for a mistake in the text: the authors claim "For high performance every OpenCL device has an associated local memory. This is the memory closest to the OpenCL processing element". This is false, as OpenCL does not define how local memory is to be implemented, and therefore some vendors choose to implement it over the same physical banks of their global memory (which means it operates at the same latency as global memory - providing no performance gains). Local memory with regards to OpenCL, has no performance considerations. These are the domain of device-specific targeting. As you may by now suspect, the list of false claims is much longer, but I will not indulge in it at this time.On a positive note, the set of examples provided in the book (and some are actually pretty good) allows the authors to demonstrate the usage of key concepts in OpenCL programming, which leads me to believe the authors are indeed experienced users of OpenCL, although lack a deep understanding of the different aspects of the standard (they should be, as the lead author is an AMD employee, working on OpenCL code optimizations).The bottom line is simple: if you are serious about learning OpenCL programming - keep away from this book. Not because it doesn't give good examples, but because it is filled with misguided, and at times simply wrong information, both in background topics, but also, and perhaps more concerning, the actual matter of the OpenCL standard. Do yourself a favor and pick a copy of "OpenCL in Action" (published by: Manning Publications), or "Heterogeneous Computing with OpenCL" (published by: Morgan Kaufmann), or even "The OpenCL Programming Guide" (published by: Addison-Wesley Professional), which closely resembles the standard (written by some of the standard editors...), before you pick this book.

OpenCL Programming by Example: A comprehensive guide on OpenCL programming with examples with this book and ebook

What do you get with Print?

Contact Details

Shipping Address

Billing Address

CUDA or OpenCL?

Advanced Micro Devices, Inc. (AMD)

NVIDIA®

Intel®

ARM Mali™ GPUs

Windows

Linux

Installing OpenCL on a Linux system with an AMD graphics card

Installing OpenCL on a Linux system with an NVIDIA graphics card

Installing OpenCL on a Windows system with an AMD graphics card

Installing OpenCL on a Windows system with an NVIDIA graphics card

Apple OSX

Multiple installations

Implement the SAXPY routine in OpenCL

OpenCL code

Tip

OpenCL program flow

Run on a different device

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the authors

FAQs

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access

NVIDIA^®

Intel^®

ARM Mali^™ GPUs