Introduction to OpenCL

Open Computing Language is a framework for writing programs that execute across heterogeneous platforms. They consist for example of CPUs GPUs DSPs and FPGAs. OpenCL specifies a programming language (based on C99) for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task-based and data-based parallelism.

First thing to notice

While OpenCL can natively talk to a large range of devices, that doesn't mean that your code will run optimally on all of them without any effort on your part. In fact, there's no guarantee it will even run at all, given that different CL devices have very different feature sets. If you stick to the OpenCL spec and avoid vendor-specific extensions, your code should be portable, if not tuned for speed.

Actors on OpenCL system

Use cases

This are the normal cases where you should use GPUs for computation.

  • Fast Permutation: Devices moves memory faster than Host

  • Data Translation: Change from one format to another

  • Numerical Acceleration: Devices calculate faster than Host big chunks of data

Heterogeneous systems

It's a system composed of multiple computing systems. For example a desktop system with a Multicore CPU and GPU.

Here are the main components of the system:

Host: Your desktop system Compute Device: CPU, GPU, FPGA, DSP. Compute Unit: Number of cores Processing Elements: ALUs on each core.

You don't need to think too much on how the OpenCL device model fit on a specific hardware, this is the responsibility of the hardware vendor. Don't think that Processing Element is a "Processor" or CPU Core.

OpenCL Models

First to understand OpenCL we need to understand the following models.

  • Device Model: How the device look inside.

  • Execution Model: How work get done on devices.

  • Memory Model: How devices and host see data.

  • Host API: How the host control the devices.

OpenCL components

  1. C Host API: C API used to control the devices. (Ex: Memory transfer, kernel compilation)

  2. OpenCL C: Used on the device (Kernel Language)

Device Model

Some words about our memories:

  • Global Memory: Shared with all Device, but slow. And is persistent between kernel calls.

  • Constant Memory: Faster than global memory, use it for filter parameters

  • Local Memory: Private to each compute unit, and shared to all processing elements.

  • Private Memory: Faster but local to each processing element.

The Constant, Local, and private memory are scratch space so each you cannot save data there to be used by other kernels.

If you are coming from the CUDA word, this is how the OpenCl model fit on Cuda compute architecture.

Execution Model

OpenCl applications run on the Host, which submit work to the compute devices.

  1. Work Item: Basic unit of work on a compute device

  2. Kernel: The code that runs on a work item (Basically a C function)

  3. Program: Collection of kernels and other functions

  4. Context: The environment where work-items execute (Devices, their memories and command queues)

  5. Command Queue: Queue used by the host to submit work (kernels, memory copies) to the device.

It's a framework that define how kernel execute on each point on a problem (N-Dimension vector). Or can be seen as the decomposition of a task in work-items.

What need to be defined:

  • Global Work-size: Number of elements on your input vector.

  • Global offset

  • Work-group size: Size of your compute partition.

Work-Groups

On an ideal case, you would have as infinite processing elements(PE) so each PE would handle one item of your data and they will never need to communicate. This is almost never the case, so you need to partition your work.

  • Partition global work into smaller pieces

  • Each partition is called Work-group

  • Work-Groups are schedule to execute on Compute Units

  • Each Work Group has a shared memory (Same as Local memory on Compute Unit)

  • The Work-Items on Work-Groups are mapped directly to Processing Elements on Compute Units.

To Resume Work-Groups are schedule to Compute Unit, and Work-Item execute inside Processing Element.

Consider the case bellow where you have a vector of 24 elements and your work-group size is 8. Open CL will automatically partition your data like this.

Each work-group will be map to a compute unit if you have enough compute units.

This process is done automatically by opencl you just need to give the global work size (24) and the work-group size (8). Then on an ideal case each PE get one item of your work-group.

Now what would happen if you don't have enough compute united (Also a normal case). If this happen then OpenCL will give each work-group to this compute unit one by one on a serial fashion. The beauty of this is that if you just plug a better hardware (More compute units) performance will be scalled.

Instead of dealing with 1d Vectors you can also work/think on 2d.

Another thing to point out is that work-items can only communicate with they are on the same work-group. Different work-groups cannot communicate. The only option would be to break the problem on 2 kernels and use the global memory but this will be to slow.

Launching the kernel

To launch a kernel you will use the function clEnqueueNDRangeKernel on this function you will use the parameters global_work_size and local_work_size. The global_work_size parameter define the total number of (work-items/threads) that will be launched, each one executing separatelly and getting one chunch of your input data. The local_work_size parameter define the size in number of (work-items/threads) of your work-group. The work-group can share local variables.

The number of work-groups is defined by dividing the global_work_size by the __local_work_size.

Each work-group is executed on a compute-unit which is able to handle a various of work-items, not only one. So it's not a good idea to have several small work-groups. On other-hand if you have a work-sgroup with too many work-items you will loose parallelization, the best ration is found by trial and error.

The threads that work on the same compute-unit can share variables, because they will be executing on the same work-group.

The work-group size can be calculated automatically (Not always the best) if you pass the parameter NULL instead of local_work_size.

Some notes:

  • A Kernel is invoked once for each work item. Each work item has private memory.

  • Work items are grouped into a work group. Each work group shares local memory

  • The total number of all work items is specified by the global work size.

  • global and constants memory is shared across all work work items of all work groups.

Host-API

OpenCl gives a set of functions to control devices on your system. The Device does not know what to do the Host-API control the whole system. Bellow we have the major components of the Host-API

  • Platform

  • Context

  • Programs

  • Asynchronous Device calls

Platform

Platforms are OpenCl implementation. Imagine as a device driver that expose the devices that are available on the Heterogeneous for you. For example a desktop computer with 2 Gpus, 1 FPGA card and one big 32 cores CPU. The Platform API discover the devices available to you.

Context

The Context allows you to group multiple devices on some specific platform. Imagine that you want to make available all the compute units of all devices. The context have devices and memory.

Programs

Just a collection of kernels, that need to be compiled and/or loaded.

Asynchronous Device calls

The Host API provide functions to issue commands to the devices (clEnque*). Those functions are asynchronous so the host will not freeze while the devices is executing a command.

Host program Structure

Most of the OpenCl Host programs has the following structure

In terms of source code.

Actually this big code always follow the same structure 1. Define (query) platform and create a command queue 2. Define memory objects (clCreateBuffer) 3. Create Program (library of kernels) 4. Build the program 5. Setup kernel (clCreateKernel, clSetKernelArg) 6. Get the results back to the Host

OpenCl Kernels

On OpenCl the devices will execute kernels, those kernels are small functions written in OpenCl C which is a C (C99) subset. Kernels are an entry point (like the main function) for a device execution. The kernels are loaded and prepared by the Host.

Here are the main differences between C and OpenCl C:

  • No function pointers

  • No recursion

  • Has vector types

  • Has image types

  • Allow structures but kill performance, and communication with host could be complicated.

  • There is no mechanism for different kernels to cooperate

The kernel arguments will be pointers to the global memory or some values given.

Types

The only point to pay attention is that integer types are represented as two's complement and his may differ on the host.

Vector Types

Mixing scalar and vectors

Memory Regions

The diagram bellow show how to define the memory regions (Local, Private, Constant, Global)

// Pointers for an integer on the global memory
__global int *x;
__global int *y;
// This is ok
x = y;

// Now check this
__global int *a;
__private int *b;
// THIS IS AN ERROR YOU CANNOT POINT TO DIFFERENT MEMORY REGIONS
a = b;
// NOW THIS IS POSSIBLE BUT COSTLY... (Copy from private to global memory)
*a = *b;

Relevant functions

Remember that the work-group partition is automatically made by OpenCl. We just choose the work-group size.

  • get_global_id(n): Get work-item id on the dimension (n) for the work-group.

  • get_global_offset(n): Get global offser on dimension (n)

  • get_local_id(n): Which work-item am I inside the work-group on dimension (n)

1d Vector addition.

For instance imagine a function that need to calculate the addition of 2 1d Vectors of size 10.000000 (10 million elements). On normal programming this would be something like:

void vectorAdd( float *a, float *b, float *c, int numElements)
{
  int nIndex = 0;
  for (nIndex = 0; nIndex < numElements; nIndex++)
  {
    // This will execute 10.000000 times one after the other.
    c[nIndex] = a[nIndex] + b[nIndex];
  }
}

On OpenCl Kernels we have the option to execute this function on multiple processing elements. Ideally you could have one processing element for each vector element. On this case the whole operation would take 1 cycle. The truth is that frequently you have less processing elements than elements to work with. Observe that we want to substitute the for-loops by the parallel execution of multiple iterations of the previous loop.

 __kernel void
  vectorAdd(__global const float * a,
            __global const float * b,
            __global       float * c)
  {
      // Vector index
      int nIndex = get_global_id(0);
      c[nIndex] = a[nIndex] + b[nIndex];
  }

Using Local Memory

#define SCRATCH_SIZE 1024

__kernel void foo(__global float *in, __global float *out, uint32_t len) {
  // All kernels on this work-group will be able to see "scratch"
  __local float scratch[SCRATCH_SIZE];

  // Get global id of work item on dimension 0
  size_t global_idx = get_global_id(0);
  // Get our local id inside the work-group
  size_t local_idx = get_local_id(0);

  // Avoid run out of the input boundary
  if (global_idx >= len) {
    return;
  }

  // Do a copy from global memory to local memory
  scratch[local_idx] = in[global_idx];

  // Do something....

  out[global_idx] = scratch[local_idx];
}

Synchronization on OpenCl

As mentioned before there is no way to synchronize work-items(threads) on different work-groups (different work-groups will be executed on different compute units). But inside the same workgroup OpenCl gives to options:

  • mem-fences

  • barriers

    Both of those commands has as parameter CLK_LOCAL_MEM_FENCE and/or CLK_GLOBAL_MEM_FENCE. The usage of those commands also has some impact on performance, but sometimes you really need them.

Mem-fences

Waits until all reads/writes to local or global memory made by the calling work-item prior to mem_fence() are visible to all work-items(threads) in the work-group. Basically enforces that some change made by the current work-item become available to all work-group.

Barriers

Waits until all work-items in the work-group have reached this point and calls mem_fence to insure that all work-items(threads) see the same data.

Some cross terminology between Cuda and OpenCl

Last updated