Using OpenCL

Exclusive offer: get 50% off this eBook here
OpenCL Parallel Programming Development Cookbook

OpenCL Parallel Programming Development Cookbook — Save 50%

Accelerate your applications and understand high-performance computing with over 50 OpenCL recipes with this book and ebook

$32.99    $16.50
by Raymond Tay | August 2013 | Cookbooks Web Graphics & Video

In this article by Raymond Tay, the author of OpenCL Parallel Programming Development Cookbook, we will cover the following recipes:

  • Querying OpenCL platforms
  • Querying OpenCL devices on your platform
  • Querying for OpenCL device extensions
  • Querying OpenCL contexts

(For more resources related to this topic, see here.)

Let's start the journey by looking back into the history of computing and why OpenCL is important from the respect that it aims to unify the software programming model for heterogeneous devices. The goal of OpenCL is to develop a royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers, and handheld/embedded devices. This effort is taken by "The Khronos Group" along with the participation of companies such as Intel, ARM, AMD, NVIDIA, QUALCOMM, Apple, and many others. OpenCL allows the software to be written once and then executed on the devices that support it. In this way it is akin to Java, this has benefits because software development on these devices now has a uniform approach, and OpenCL does this by exposing the hardware via various data structures, and these structures interact with the hardware via Application Programmable Interfaces ( APIs ). Today, OpenCL supports CPUs that includes x86s, ARM and PowerPC and GPUs by AMD, Intel, and NVIDIA.

Developers can definitely appreciate the fact that we need to develop software that is cross-platform compatible, since it allows the developers to develop an application on whatever platform they are comfortable with, without mentioning that it provides a coherent model in which we can express our thoughts into a program that can be executed on any device that supports this standard. However, what cross-platform compatibility also means is the fact that heterogeneous environments exists, and for quite some time, developers have to learn and grapple with the issues that arise when writing software for those devices ranging from execution model to memory systems. Another task that commonly arose from developing software on those heterogeneous devices is that developers were expected to express and extract parallelism from them as well. Before OpenCL, we know that various programming languages and their philosophies were invented to handle the aspect of expressing parallelism (for example, Fortran, OpenMP, MPI, VHDL, Verilog, Cilk, Intel TBB, Unified parallel C, Java among others) on the device they executed on. But these tools were designed for the homogeneous environments, even though a developer may think that it's to his/her advantage, since it adds considerable expertise to their resume. Taking a step back and looking at it again reveals that is there is no unified approach to express parallelism in heterogeneous environments. We need not mention the amount of time developers need to be productive in these technologies, since parallel decomposition is normally an involved process as it's largely hardware dependent. To add salt to the wound, many developers only have to deal with homogeneous computing environments, but in the past few years the demand for heterogeneous computing environments grew.

The demand for heterogeneous devices grew partially due to the need for high performance and highly reactive systems, and with the "power wall" at play, one possible way to improve more performance was to add specialized processing units in the hope of extracting every ounce of parallelism from them, since that's the only way to reach power efficiency. The primary motivation for this shift to hybrid computing could be traced to the research headed entitled Optimizing power using Transformations by Anantha P. Chandrakasan. It brought out a conclusion that basically says that many-core chips (which run at a slightly lower frequency than a contemporary CPU) are actually more power-efficient. The problem with heterogeneous computing without a unified development methodology, for example, OpenCL, is that developers need to grasp several types of ISA and with that the various levels of parallelism and their memory systems are possible. CUDA, the GPGPU computing toolkit, developed by NVIDIA deserves a mention not only because of the remarkable similarity it has with OpenCL, but also because the toolkit has a wide adoption in academia as well as industry. Unfortunately CUDA can only drive NVIDIA's GPUs.

The ability to extract parallelism from an environment that's heterogeneous is an important one simply because the computation should be parallel, otherwise it would defeat the entire purpose of OpenCL. Fortunately, major processor companies are part of the consortium led by The Khronos Group and actively realizing the standard through those organizations. Unfortunately the story doesn't end there, but the good thing is that we, developers, realized that a need to understand parallelism and how it works in both homogeneous and heterogeneous environments. OpenCL was designed with the intention to express parallelism in a heterogeneous environment.

For a long time, developers have largely ignored the fact that their software needs to take advantage of the multi-core machines available to them and continued to develop their software in a single-threaded environment, but that is changing (as discussed previously). In the many-core world, developers need to grapple with the concept of concurrency, and the advantage of concurrency is that when used effectively, it maximizes the utilization of resources by providing progress to others while some are stalled.

When software is executed concurrently with multiple processing elements so that threads can run simultaneously, we have parallel computation. The challenge that the developer has is to discover that concurrency and realize it. And in OpenCL, we focus on two parallel programming models: task parallelism and data parallelism.

Task parallelism means that developers can create and manipulate concurrent tasks. When developers are developing a solution for OpenCL, they would need to decompose a problem into different tasks and some of those tasks can be run concurrently, and it is these tasks that get mapped to processing elements ( PEs ) of a parallel environment for execution. On the other side of the story, there are tasks that cannot be run concurrently and even possibly interdependent. An additional complexity is also the fact that data can be shared between tasks.

When attempting to realize data parallelism, the developer needs to readjust the way they think about data and how they can be read and updated concurrently. A common problem found in parallel computation would be to compute the sum of all the elements given in an arbitrary array of values, while storing the intermediary summed value and one possible way to do this is illustrated in the following diagram and the operator being applied there, that is, is any binary associative operator. Conceptually, the developer could use a task to perform the addition of two elements of that input to derive the summed value.

Whether the developer chooses to embody task/data parallelism is dependent on the problem, and an example where task parallelism would make sense will be by traversing a graph. And regardless of which model the developer is more inclined with, they come with their own sets of problems when you start to map the program to the hardware via OpenCL. And before the advent of OpenCL, the developer needs to develop a module that will execute on the desired device and communication, and I/O with the driver program. An example example of this would be a graphics rendering program where the CPU initializes the data and sets everything up, before offloading the rendering to the GPU. OpenCL was designed to take advantage of all devices detected so that resource utilization is maximized, and hence in this respect it differs from the "traditional" way of software development.

Now that we have established a good understanding of OpenCL, we should spend some time understanding how a developer can learn it. And not to fret, because every project you embark with, OpenCL will need you to understand the following:

  • Discover the makeup of the heterogeneous system you are developing for
  • Understand the properties of those devices by probing it
  • Start the parallel program decomposition using either or all of task parallelism or data parallelism, by expressing them into instructions also known as kernels that will run on the platform
  • Set up data structures for the computation
  • Manipulate memory objects for the computation
  • Execute the kernels in the order that's desired on the proper device
  • Collate the results and verify for correctness

Next, we need to solidify the preceding points by taking a deeper look into the various components of OpenCL. The following components collectively make up the OpenCL architecture:

  • Platform Model: A platform is actually a host that is connected to one or more OpenCL devices. Each device comprises possibly multiple compute units ( CUs ) which can be decomposed into one or possibly multiple processing elements, and it is on the processing elements where computation will run.
  • Execution Model: Execution of an OpenCL program is such that the host program would execute on the host, and it is the host program which sends kernels to execute on one or more OpenCL devices on that platform.

    When a kernel is submitted for execution, an index space is defined such that a work item is instantiated to execute each point in that space. A work item would be identified by its global ID and it executes the same code as expressed in the kernel. Work items are grouped into work groups and each work group is given an ID commonly known as its work group ID, and it is the work group's work items that get executed concurrently on the PEs of a single CU.

    That index space we mentioned earlier is known as NDRange describing an N-dimensional space, where N can range from one to three. Each work item has a global ID and a local ID when grouped into work groups, that is distinct from the other and is derived from NDRange. The same can be said about work group IDs. Let's use a simple example to illustrate how they work.

    Given two arrays, A and B, of 1024 elements each, we would like to perform the computation of vector multiplication also known as dot product, where each element of A would be multiplied by the corresponding element in B. The kernel code would look something as follows:

    __kernel void vector_multiplication(__global int* a,
    __global int* b,
    __global int* c) {
    int threadId = get_global_id(0); // OpenCL function
    c[i] = a[i] * b[i];
    }

    In this scenario, let's assume we have 1024 processing elements and we would assign one work item to perform exactly one multiplication, and in this case our work group ID would be zero (since there's only one group) and work items IDs would range from {0 … 1023}. Recall what we discussed earlier, that it is the work group's work items that can executed on the PEs. Hence reflecting back, this would not be a good way of utilizing the device.

    In this same scenario, let's ditch the former assumption and go with this: we still have 1024 elements but we group four work items into a group, hence we would have 256 work groups with each work group having an ID ranging from {0 … 255}, but it is noticed that the work item's global ID still would range from {0 … 1023} simply because we have not increased the number of elements to be processed. This manner of grouping work items into their work groups is to achieve scalability in these devices, since it increases execution efficiency by ensuring all PEs have something to work on.

    The NDRange can be conceptually mapped into an N-dimensional grid and the following diagram illustrates how a 2DRange works, where WG-X denotes the length in rows for a particular work group and WG-Y denotes the length in columns for a work group, and how work items are grouped including their respective IDs in a work group.

    Before the execution of the kernels on the device(s), the host program plays an important role and that is to establish context with the underlying devices and laying down the order of execution of the tasks. The host program does the context creation by establishing the existence (creating if necessary) of the following:

    • All devices to be used by the host program
    • The OpenCL kernels, that is, functions and their abstractions that will run on those devices
    • The memory objects that encapsulated the data to be used / shared by the OpenCL kernels.
    • Once that is achieved, the host needs to create a data structure called a command queue that will be used by the host to coordinate the execution of the kernels on the devices and commands are issued to this queue and scheduled onto the devices. A command queue can accept: kernel execution commands, memory transfer commands, and synchronization commands. Additionally, the command queues can execute the commands in-order, that is, in the order they've been given, or out-of-order. If the problem is decomposed into independent tasks, it is possible to create multiple command queues targeting different devices and scheduling those tasks onto them, and then OpenCL will run them concurrently.
  • Memory Model: So far, we have understood the execution model and it's time to introduce the memory model that OpenCL has stipulated. Recall that when the kernel executes, it is actually the work item that is executing its instance of the kernel code. Hence the work item needs to read and write the data from memory and each work item has access to four types of memories: global, constant, local, and private. These memories vary from size as well as accessibilities, where global memory has the largest size and is most accessible to work items, whereas private memory is possibly the most restrictive in the sense that it's private to the work item. The constant memory is a read-only memory where immutable objects are stored and can be shared with all work items. The local memory is only available to all work items executing in the work group and is held by each compute unit, that is, CU-specific.

    The application running on the host uses the OpenCL API to create memory objects in global memory and will enqueue memory commands to the command queue to operate on them. The host's responsibility is to ensure that data is available to the device when the kernel starts execution, and it does so by copying data or by mapping/unmapping regions of memory objects. During a typical data transfer from the host memory to the device memory, OpenCL commands are issued to queues which may be blocking or non-blocking. The primary difference between a blocking and non-blocking memory transfer is that in the former, the function calls return only once (after being queued) it is deemed safe, and in the latter the call returns as soon as the command is enqueued.

    Memory mapping in OpenCL allows a region of memory space to be available for computation and this region can be blocking or non-blocking and the developer can treat this space as readable or writeable or both.

    Hence forth, we are going to focus on getting the basics of OpenCL by letting our hands get dirty in developing small OpenCL programs to understand a bit more, programmatically, how to use the platform and execution model of OpenCL.

The OpenCL specification Version 1.2 is an open, royalty-free standard for general purpose programming across various devices ranging from mobile to conventional CPUs, and lately GPUs through an API and the standard at the time of writing supports:

  • Data and task based parallel programming models
  • Implements a subset of ISO C99 with extensions for parallelism with some restrictions such as recursion, variadic functions, and macros which are not supported
  • Mathematical operations comply to the IEEE 754 specification
  • Porting to handheld and embedded devices can be accomplished by establishing configuration profiles
  • Interoperability with OpenGL, OpenGL ES, and other graphics APIs

Throughout this article, we are going to show you how you can become proficient in programming OpenCL.

As you go through the article, you'll discover not only how to use the API to perform all kinds of operations on your OpenCL devices, but you'll also learn how to model a problem and transform it from a serial program to a parallel program. More often than not, the techniques you'll learn can be transferred to other programming toolsets.

In the toolsets, I have worked with OpenCLTM, CUDATM, OpenMPTM, MPITM, Intel thread building blocksTM, CilkTM, CilkPlusTM, which allows the developer to express parallelism in a homogeneous environment and find the entire process of learning the tools to application of knowledge to be classified into four parts. These four phases are rather common and I find it extremely helpful to remember them as I go along. I hope you will be benefited from them as well.

  • Finding concurrency: The programmer works in the problem domain to identify the available concurrency and expose it to use in the algorithm design
  • Algorithm structure: The programmer works with high-level structures for organizing a parallel algorithm
  • Supporting Structures: This refers to how the parallel program will be organized and the techniques used to manage shared data
  • Implementation mechanisms: The final step is to look at specific software constructs for implementing a parallel program.

Don't worry about these concepts, they'll be explained as we move through the article.

The next few recipes we are going to examine have to do with understanding the usage of OpenCL APIs, by focusing our efforts in understanding the platform model of the architecture.

OpenCL Parallel Programming Development Cookbook Accelerate your applications and understand high-performance computing with over 50 OpenCL recipes with this book and ebook
Published: August 2013
eBook Price: $32.99
Book Price: $54.99
See more
Select your format and quantity:

Querying OpenCL platforms

Before you start coding, ensure that you have installed the appropriate OpenCL development toolkit for the platform you are developing for. In this recipe, we are going to demonstrate how you can use OpenCL to query its platform to retrieve simple information about the compliant devices it has detected and its various properties.

Getting ready

In this first OpenCL application, you'll get to query your computer for the sort of OpenCL platform that's installed. In the setup of your computer, you could have a configuration where both NVIDIA and AMD graphic cards are installed, and in this case you might have installed both the AMD APP SDK and NVIDIA's OpenCL toolkit. And hence you would have both the platforms installed.

The following code listing is extracted from Ch1/platform_details/platform_details.c.

How to do it…

Pay attention to the included comments, as they would help you to understand each individual function:

#include <stdio.h>
#include <stdlib.h>

#ifdef APPLE
#include <OpenCL/cl.h>
#else
#include <CL/cl.h>
#endif

void displayPlatformInfo(cl_platform_id id,
cl_platform_info param_name,
const char* paramNameAsStr) {
cl_int error = 0;
size_t paramSize = 0;

error = clGetPlatformInfo( id, param_name, 0, NULL,
&paramSize );
char* moreInfo = (char*)alloca( sizeof(char) * paramSize);
error = clGetPlatformInfo( id, param_name, paramSize,
moreInfo, NULL );
if (error != CL_SUCCESS ) {
perror("Unable to find any OpenCL platform
information");
return;
}
printf("%s: %s\n", paramNameAsStr, moreInfo);
}

int main() {
/* OpenCL 1.2 data structures */
cl_platform_id* platforms;
/* OpenCL 1.1 scalar data types */
cl_uint numOfPlatforms;
cl_int error;

/*
Get the number of platforms
Remember that for each vendor's SDK installed on the
Computer, the number of available platform also
increased.
*/
error = clGetPlatformIDs(0, NULL, &numOfPlatforms);
if(error < 0) {
perror("Unable to find any OpenCL platforms");
exit(1);
}
// Allocate memory for the number of installed platforms.
// alloca(...) occupies some stack space but is
// automatically freed on return
platforms = (cl_platform_id*) alloca(sizeof(cl_platform_id)
* numOfPlatforms);
printf("Number of OpenCL platforms found: %d\n",
numOfPlatforms);

// We invoke the API 'clPlatformInfo' twice for each
// parameter we're trying to extract
// and we use the return value to create temporary data
// structures (on the stack) to store
// the returned information on the second invocation.
for(cl_uint i = 0; i < numOfPlatforms; ++i) {
displayPlatformInfo( platforms[i],
CL_PLATFORM_PROFILE,
"CL_PLATFORM_PROFILE" );

displayPlatformInfo( platforms[i],
CL_PLATFORM_VERSION,
"CL_PLATFORM_VERSION" );
displayPlatformInfo( platforms[i],
CL_PLATFORM_NAME,
"CL_PLATFORM_NAME" );
displayPlatformInfo( platforms[i],
CL_PLATFORM_VENDOR,
"CL_PLATFORM_VENDOR" );
displayPlatformInfo( platforms[i],
CL_PLATFORM_EXTENSIONS,
"CL_PLATFORM_EXTENSIONS" );
}
return 0;
}

To compile it on the UNIX platform, you would run a compile command similar to the following:

gcc –std=c99 –Wall –DUNIX –g –DDEBUG –DAPPLE –arch i386 –o platform_
details platform_details.c –framework OpenCL

When that happens, you would have a binary executable named platform_details.

To run the program, simply execute the platform_details program, and a sample output will be an OSX:

Number of OpenCL platforms found: 1
CL_PLATFORM_PROFILE: FULL_PROFILE
CL_PLATFORM_VERSION: OpenCL 1.0 (Dec 23 2010 17:30:26)
CL_PLATFORM_NAME: Apple
CL_PLATFORM_VENDOR: Apple
CL_PLATFORM_EXTENSIONS:

How it works…

When you first learn to program OpenCL, it can be a daunting task but it does get better as we move along. So, let's decipher the source code that we've just seen. The file is a C source file and what you'll notice is that it's arranged such that the system header files are almost always placed right near the top:

| #include <stdlib.h>
| #include <stdio.h>

Next is what the C programmers would call as the platform-dependent code:

| #ifdef APPLE
| #include <OpenCL/cl.h>
| #else
| #include <CL/cl.h>
| #endif

The OpenCL header files are needed for the program to be compiled because they contain the method signatures. Now, we will try to understand what the rest of the code is doing. In OpenCL, one of the code conventions is to have data types be prefixed by cl_ and you'll find data types for each of the platform, device and context as cl_platform_XX, cl_device_XX, cl_context_XX , and APIs prefixed in a similar fashion by cl and one such API is clGetPlatformInfo.

In OpenCL, the APIs do not assume that you know exactly how many resources (for example platforms, devices, and contexts) are present or are needed when you write the OpenCL code. And in order to write portable code, the developers of the language have figured out a clever way to present the API such that you use the same API to pose a general question and based on the results of that question, request more information via the same API. Let me illustrate with an example.

In the code, you will notice that clGetPlatformInfo() was invoked twice. The first invocation was to query the number of platforms that were installed on the machine. Based on the results of that query, we invoked clGetPlatformInfo again, but this time we passed in context-sensitive information, for example, obtaining the name of the vendor. You'll find this pattern recurring when programming with OpenCL and the cons, I can think of is that it makes the API rather cryptic at times, but the nice thing about it is that it prevents the proliferation of APIs in the language.

Admittedly, this is rather trivial when it comes to the entire ecosystem of programming OpenCL

Querying OpenCL devices on your platform

We'll now query OpenCL devices that are installed on your platforms.

Getting ready

The code listing discussed in the How to do it… section presents an abbreviated portion of the code in Ch1/device_details/device_details.c. This code demonstrates how you can obtain the types of devices installed on your platform via clGetDeviceIDs. You'll use that information to retrieve detailed data about the device by passing it to clGetDeviceInfo.

How to do it…

Pay attention to the included comments, as they would help you understand each individual function. We've included the main part of this recipe with highlighted commentary:

/* C-function prototype */
void displayDeviceDetails(cl_device_id id, cl_device_info param_name,
const char* paramNameAsStr) ;


void displayDeviceInfo(cl_platform_id id,
cl_device_type dev_type) {
/* OpenCL 1.1 device types */

cl_int error = 0;
cl_uint numOfDevices = 0;

/* Determine how many devices are connected to your
platform */
error = clGetDeviceIDs(id, dev_type, 0, NULL,
&numOfDevices);
if (error != CL_SUCCESS ) {
perror("Unable to obtain any OpenCL compliant device
info");
exit(1);
}
cl_device_id* devices = (cl_device_id*)
alloca(sizeof(cl_device_id) * numOfDevices);

/* Load the information about your devices into the
variable 'devices'
*/
error = clGetDeviceIDs(id, dev_type, numOfDevices, devices,
NULL);
if (error != CL_SUCCESS ) {
perror("Unable to obtain any OpenCL compliant device
info");
exit(1);
}
printf("Number of detected OpenCL devices:
%d\n",numOfDevices);
/*
We attempt to retrieve some information about the
devices.
*/
for(int i = 0; i < numOfDevices; ++ i ) {
displayDeviceDetails( devices[i], CL_DEVICE_TYPE, "CL_DEVICE_
TYPE" );
displayDeviceDetails( devices[i], CL_DEVICE_VENDOR_ID, "CL_
DEVICE_VENDOR_ID" );
displayDeviceDetails( devices[i], CL_DEVICE_MAX_COMPUTE_UNITS,
"CL_DEVICE_MAX_COMPUTE_UNITS" );
displayDeviceDetails( devices[i], CL_DEVICE_MAX_WORK_ITEM_
DIMENSIONS, "CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS" );
displayDeviceDetails( devices[i], CL_DEVICE_MAX_WORK_ITEM_
SIZES, "CL_DEVICE_MAX_WORK_ITEM_SIZES" );
displayDeviceDetails( devices[i], CL_DEVICE_MAX_WORK_GROUP_
SIZE, "CL_DEVICE_MAX_WORK_GROUP_SIZE" );
}
}

void displayDeviceDetails(cl_device_id id,
cl_device_info param_name,
const char* paramNameAsStr) {
cl_int error = 0;
size_t paramSize = 0;

error = clGetDeviceInfo( id, param_name, 0, NULL, &paramSize );
if (error != CL_SUCCESS ) {
perror("Unable to obtain device info for param\n");
return;
}

/*
The cl_device_info are preprocessor directives defined in cl.h
*/

switch (param_name) {
case CL_DEVICE_TYPE: {
cl_device_type* devType = (cl_device_type*)
alloca(sizeof(cl_device_type) * paramSize);
error = clGetDeviceInfo( id, param_name, paramSize,
devType, NULL );

if (error != CL_SUCCESS ) {
perror("Unable to obtain device info for param\n");
return;
}

switch (*devType) {
case CL_DEVICE_TYPE_CPU :
printf("CPU detected\n");break;
case CL_DEVICE_TYPE_GPU :
printf("GPU detected\n");break;
case CL_DEVICE_TYPE_DEFAULT :
printf("default detected\n");break;
}
}break;
// omitted code – refer to source "device_details.c"
} //end of switch
}

On UNIX platforms, you can compile device_details.c by running this command on your terminal:

gcc -std=c99 -Wall -DUNIX -g -DDEBUG -DAPPLE -arch i386 -o device_details
device_details.c -framework OpenCL

And a binary executable named device_details should be deposited locally on your machine.

When you execute the binary executable depending on your machine's setup, you will see varying results. But on my OSX platform here is the output when executed on a machine with Intel Core i5 processor with a NVIDIA mobile GPU GT330m (extensions are highlighted):

Number of OpenCL platforms found: 1
CL_PLATFORM_PROFILE: FULL_PROFILE
CL_PLATFORM_VERSION: OpenCL 1.0 (Dec 23 2010 17:30:26)
CL_PLATFORM_NAME: Apple
CL_PLATFORM_VENDOR: Apple
CL_PLATFORM_EXTENSIONS:
Number of detected OpenCL devices: 2
GPU detected
VENDOR ID: 0x1022600
Maximum number of parallel compute units: 6
Maximum dimensions for global/local work-item IDs: 3
Maximum number of work-items in each dimension: 512
Maximum number of work-items in a work-group: 512
CPU detected
VENDOR ID: 0x1020400
Maximum number of parallel compute units: 4
Maximum dimensions for global/local work-item IDs: 3
Maximum number of work-items in each dimension: 1
Maximum number of work-items in a work-group: 1

How it works…

Leveraging the work we did in the previous section, now we have made use of the platform via clGetPlatformInfo, that was detected to query for the devices attached. This time, we used new API functions, clGetDeviceIDs and clGetDeviceInfo. The former attempts to uncover all the basic information about the devices attached to the given platform, and we use clGetDeviceInfo to iterate through the results to understand more about their capabilities. This information is valuable when you are crafting your algorithm and is not very sure about what device it's going to be run on. Considering that OpenCL supports various processors, it's a good way to write portable code.

There is actually a lot more information you can derive from your device and I'd strongly suggest you to refer http://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/ and look at the main page for clGetDeviceInfo.

Querying for OpenCL device extensions

The extensions in OpenCL allow the programmer to leverage on additional capabilities provided by the vendor of the device, and hence they're optional. However, there are extensions that are recognized by OpenCL and purportedly supported by major vendors.

Here's a partial list of the approved and supported extensions in OpenCL 1.2. If you wish to discover the entire list of extensions that adopters of OpenCL have made public (some are given in the table), please refer to the PDF document via this link: http://www.khronos.org/registry/cl/specs/opencl-1.2-extensions.pdf.

Extension name

Description

cl_khr_fp64

Double precision floating-point

cl_khr_int64_base_atomics

64-bit integer base atomic operations. Provides atomic operations for addition, subtraction, exchange, increment/decrement and CAS

cl_khr_int64_extended_atomics

64-bit integer extended atomic operations. Provides atomic operations for finding the minimum, maximum and Boolean operations like and, or and xor.

cl_khr_3d_image_writes

Writes to 3D image objects

cl_khr_fp16

Half-precision floating point

cl_khr_global_int32_base_atomics

Atomics for 32-bit operands

cl_khr_global_int32_extended_atomics

More atomic functionality for 32-bit operands

cl_khr_local_int32_base_atomics

Atomics for 32-bit operands in shared memory space

cl_khr_local_int32_extended_atomics

More atomic functionality for 32-bit operands in shared memory space

cl_khr_byte_addressable_store

Allows memory writes to bytes less than a 32-bit word

cl_APPLE_gl_sharing

MacOSX OpenGL sharing. Allows applications to use the OpenGL buffer, texture and render buffer objects as OpenCL memory objects.

cl_khr_gl_sharing

OpenGL Sharing

cl_khr_gl_event

CL event objects from GL sync objects

cl_khr_d3d10_sharing

Sharing memory objects with Direct3D 10

Next, let's find out how we can determine what extensions are supported and available on your platform by leveraging the previous code we've worked on.

Getting ready

The listing below only shows the interesting portion of the code found in Ch1/device_extensions/device_extensions.c. Various devices that are OpenCL compliant will have different capabilities, and during your application development you definitely want to make sure certain extensions are present prior to making use of them. The code discussed in the How to do it… section of this recipe shows you how to retrieve those extensions.

How to do it…

We've included the main querying function, which allows you to implement this particular recipe:


void displayDeviceDetails(cl_device_id id,
cl_device_info param_name,
const char* paramNameAsStr) {

cl_int error = 0;
size_t paramSize = 0;

error = clGetDeviceInfo( id, param_name, 0, NULL, &paramSize );
if (error != CL_SUCCESS ) {
perror("Unable to obtain device info for param\n");
return;
}
/* the cl_device_info are preprocessor directives defined in cl.h
*/
switch (param_name) {
// code omitted – refer to "device_extensions.c"
case CL_DEVICE_EXTENSIONS : {
size_t* ret = (size_t*) alloc(sizeof(size_t) * paramSize);
error = clGetDeviceInfo( id, param_name, paramSize, ret,
NULL );
char* extension_info = (char*)malloc(sizeof(char) *
(*ret));
error = clGetDeviceInfo( id, CL_DEVICE_EXTENSIONS,
sizeof(extension_info), extension_info, NULL);
printf("\tSupported extensions: %s\n",
extension_info);
}break;
} //end of switch
}

To compile the code, do as you did before by running a similar command on your terminal like this:

gcc -std=c99 -Wall -DUNIX -g -DDEBUG -DAPPLE -arch i386 -o device_
extensions device_extensions.c -framework OpenCL

On a UNIX platform, here's what we got when executed on an Intel Core i5 processor with an NVIDIA mobile GPU GT330m (extensions are highlighted):

Number of OpenCL platforms found: 1
CL_PLATFORM_PROFILE: FULL_PROFILE
CL_PLATFORM_VERSION: OpenCL 1.0 (Dec 23 2010 17:30:26)
CL_PLATFORM_NAME: Apple
CL_PLATFORM_VENDOR: Apple
CL_PLATFORM_EXTENSIONS:
Number of detected OpenCL devices: 2
GPU detected
VENDOR ID: 0x1022600
Maximum number of parallel compute units: 6
Maximum dimensions for global/local work-item IDs: 3
Maximum number of work-items in each dimension: ( 512 512 64 )
Maximum number of work-items in a work-group: 512
Supported extensions: cl_khr_byte_addressable_store cl_khr_
global_int32_base_atomics cl_khr_global_int32_extended_atomics
cl_APPLE_gl_sharing cl_APPLE_SetMemObjectDestructor cl_APPLE_
ContextLoggingFunctions cl_khr_local_int32_base_atomics cl_khr_local_
int32_extended_atomics
CPU detected
VENDOR ID: 0x1020400
Maximum number of parallel compute units: 4
Maximum dimensions for global/local work-item IDs: 3
Maximum number of work-items in each dimension: ( 1 1 1 )
Maximum number of work-items in a work-group: 1
Supported extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store
cl_APPLE_gl_sharing cl_APPLE_SetMemObjectDestructor cl_APPLE_
ContextLoggingFunctions

How it works…

When we examine the work we just did, we simply leveraged on the existing code and added the needed functionality where it was required, namely by adding code to handle the case where CL_DEVICE_EXTENSIONS was being passed in. We created an array of a fixed size on the stack and passed that array to clGetDeviceInfo, where the API will eventually store the information into the array. Extracting the information is as simple as printing out the array. For advanced usage, you might want to deposit that information into a global table structure where the other parts of the application can make use of it.

To understand what those extensions mean and how you can take advantage of them, I'd suggest that you refer to the Khronos register for OpenCL: http://www.khronos.org/registry/cl/.

We won't dwell too much on each extension that we've seen so far. Let's move on to understanding the OpenCL contexts.

Querying OpenCL contexts

An OpenCL context is created with one or more devices. Contexts are used by the OpenCL runtime for managing objects such as command queues (the object that allows you to send commands to the device), memory, program, and kernel objects, and for executing kernels on one or more devices specified in the context.

In more detail, OpenCL contexts can be created by associating a collection of devices that are available for the platform via clCreateContext or by associating it with a particular type of device, for example, CPU, GPUs, and so on, via clCreateContextFromType . However, in either way you cannot create contexts that are associated with more than one platform. Let's use the example of vector multiplication in the Introduction section to demonstrate these concepts. The problem of vector multiplication or dot product can be solved using: pen and paper, CPU, GPU, or GPU + CPU. Obviously, the first option doesn't quite scale when we have a little more than 20 elements and with OpenCL you have more options. The first thing you need to decide is which platform it should be run, and in OpenCL it means deciding whether to use the AMD, NVIDIA, Intel, and so on. And what comes next is to decide whether to run the dot product on all of the devices listed for that platform or only some of it.

So, let's assume that the platform reports one Intel Core i7 and 3 AMD GPUs and the developer could use the clCreateContextFromType to restrict execution to either CPUs or GPUs, but when you use clCreateContext, you can list all the four devices to be executed against, theoretically speaking (however, in practice it's hard to use all CPUs and GPUs effectively because the GPU can push more threads for execution than the CPU). The following diagram illustrates the options available to the developer to create contexts assuming the host environment is installed with both Intel and AMD's OpenCL platform software. The configuration gets a little more interesting when you consider the Ivy Bridge Intel processor, which includes an HD Graphics co-processor that allows a context that's both CPU and GPU aware.

Contexts have another interesting property, that is, it retains a reference count so that third-party libraries can refer to it and hence utilize the devices. For example, if the cl_khr_d3d10_sharing extension is available on your device, you can actually interoperate between OpenCL and Direct3D 10, and treat Direct3D 10 resources similar to memory objects as OpenCL memory objects that you can read from or write to. However, we will not demonstrate the capability with this extension in this article and will instead leave it to the reader to engage themselves in further exploration.

Getting ready

The code listing given in the How to do it… section is extracted from Ch1/context_query/context_details.c, and it illustrates how to create and release OpenCL contexts.

How to do it…

To query an OpenCL context, you need to include a function similar to the following in your code. You should reference the full code listing alongside this recipe:

void createAndReleaseContext(cl_platform_id id,
cl_device_type dev_type) {
/* OpenCL 1.1 scalar types */
cl_int error = 0;
cl_uint numOfDevices = 0;

/* Determine how many devices are connected to your platform */
error = clGetDeviceIDs(id, dev_type, 0, NULL, &numOfDevices);
if (error != CL_SUCCESS ) {
perror("Unable to obtain any OpenCL compliant device info");
exit(1);
}
cl_device_id* devices = (cl_device_id*)
alloca(sizeof(cl_device_id) * numOfDevices);

/*
Load the information about your devices into the variable
'devices'
*/

error = clGetDeviceIDs(id, dev_type, numOfDevices, devices, NULL);
if (error != CL_SUCCESS ) {
perror("Unable to obtain any OpenCL compliant device info");
exit(1);
}

printf("Number of detected OpenCL devices: %d\n",
numOfDevices);

/*
We attempt to create contexts for each device we find,
report it and release the context. Once a context is
created, its context is implicitly
retained and so you don't have to invoke
'clRetainContext'
*/

for(int i = 0; i < numOfDevices; ++ i ) {
cl_context context = clCreateContext(NULL, 1,
&devices[i],
NULL, NULL,
&error);
cl_uint ref_cnt = 0;
if (error != CL_SUCCESS) {
perror("Can't create a context");
exit(1);
}

error = clGetContextInfo(context,
CL_CONTEXT_REFERENCE_COUNT,
sizeof(ref_cnt), &ref_cnt,
NULL);
if (error != CL_SUCCESS) {
perror("Can't obtain context information");
exit(1);
}
printf("Reference count of device is %d\n", ref_cnt);
// Release the context
clReleaseContext(context);
}
}

On UNIX platforms, you can compile and build the program by typing the following command

gcc -std=c99 -Wall -DUNIX -g -DDEBUG -DAPPLE -arch i386 -o context_
details context_details.c -framework OpenCL

On the test machine, we have two OpenCL compliant devices. The first is the Intel Core i5 CPU, and the second is the NVIDIA mobile GT330m GPU. And the following is the output:

Number of OpenCL platforms found: 1
Number of detected OpenCL devices: 2
Reference count of device is 1
Reference count of device is 1

How it works…

If you have been following the article, you should realize that we didn't do anything special other than leverage on the previous exercises where we discover the sort of platforms installed, and with that uncover the devices and finally use that information to create the relevant contexts. Finally, with those relevant contexts we can query them. What you will notice is that the context's reference count is one in both cases, which indicates that a memory object is currently referencing it and the fact that we passed in CL_CONTEXT_REFERENCE_COUNT reflects this. This counter is only good when you want to detect if the application is experiencing a context leak, which actually means a memory leak. For OpenCL devices such as the CPU or GPU, the problem might not sound as a big deal. But for mobile processors, it would pose quite a serious problem since memory leaks, in general, wastes resources and the ultimately depleting battery life.

There are actually more details where you can query the context via clGetContextInfo by passing in various cl_context_info types. Here's a list of them:

cl_context_info

Return type

Information returned in param_name

CL_CONTEXT_REFERENCE COUNT

cl_uint

This variable returns the context reference count

CL_CONTEXT_NUM_DEVICES

cl_uint

This variable returns the number of devices in context

CL_CONTEXT_DEVICES

cl_device_id[]

This variable returns a list of devices in context

CL_CONTEXT_PROPERTIES

cl_context_properties

This variable returns the properties argument specified in clCreateContext or clCreateContext  FromType

Summary

We've understood the basics of querying the platform, devices, extensions, and contexts in this article.

Resources for Article :


Further resources on this subject:


OpenCL Parallel Programming Development Cookbook Accelerate your applications and understand high-performance computing with over 50 OpenCL recipes with this book and ebook
Published: August 2013
eBook Price: $32.99
Book Price: $54.99
See more
Select your format and quantity:

About the Author :


Raymond Tay

Raymond Tay has been a software developer for the past decade and his favorite programming languages include Scala, Haskell, C, and C++. He started playing with GPGPU technology since 2008, first with the CUDA toolkit by NVIDIA and OpenCL toolkit by AMD, and then Intel. In 2009, he decided to submit a GPGPU project on which he was working to the editorial committee working on the "GPU Computing Gems" to be published by Morgan Kauffmann. And though his work didn't make it to the final published work, he was very happy to have been short-listed for candidacy. Since then, he's worked on projects that use GPGPU technology and techniques in CUDA and OpenCL. He's also passionate about functional programming paradigms and their applications in cloud computing which has led him investigating on various paths to accelerate applications in the cloud through the use of GPGPU technology and the functional programming paradigm. He is a strong believer of continuous learning and hopes to be able to continue to do so for as long as he possibly can.

Books From Packt


Mastering OpenCV with Practical Computer Vision Projects
Mastering OpenCV with Practical Computer Vision Projects

OpenCV 2 Computer Vision Application Programming Cookbook
OpenCV 2 Computer Vision Application Programming Cookbook

Instant OpenCV Starter [Instant]
Instant OpenCV Starter [Instant]

 OpenCV Computer Vision with Python
OpenCV Computer Vision with Python

Instant OpenCV for iOS [Instant]
Instant OpenCV for iOS [Instant]

OpenGL 4.0 Shading Language Cookbook
OpenGL 4.0 Shading Language Cookbook

OpenGL Development Cookbook
OpenGL Development Cookbook

OpenSceneGraph 3.0: Beginner's Guide
OpenSceneGraph 3.0: Beginner's Guide


No votes yet

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
P
K
8
r
X
K
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software