Introduction to OpenCL
Last updated
Last updated
Open Computing Language is a framework for writing programs that execute across heterogeneous platforms. They consist for example of CPUs GPUs DSPs and FPGAs. OpenCL specifies a programming language (based on C99) for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task-based and data-based parallelism.
While OpenCL can natively talk to a large range of devices, that doesn't mean that your code will run optimally on all of them without any effort on your part. In fact, there's no guarantee it will even run at all, given that different CL devices have very different feature sets. If you stick to the OpenCL spec and avoid vendor-specific extensions, your code should be portable, if not tuned for speed.
This are the normal cases where you should use GPUs for computation.
Fast Permutation: Devices moves memory faster than Host
Data Translation: Change from one format to another
Numerical Acceleration: Devices calculate faster than Host big chunks of data
It's a system composed of multiple computing systems. For example a desktop system with a Multicore CPU and GPU.
Here are the main components of the system:
Host: Your desktop system Compute Device: CPU, GPU, FPGA, DSP. Compute Unit: Number of cores Processing Elements: ALUs on each core.
You don't need to think too much on how the OpenCL device model fit on a specific hardware, this is the responsibility of the hardware vendor. Don't think that Processing Element is a "Processor" or CPU Core.
First to understand OpenCL we need to understand the following models.
Device Model: How the device look inside.
Execution Model: How work get done on devices.
Memory Model: How devices and host see data.
Host API: How the host control the devices.
C Host API: C API used to control the devices. (Ex: Memory transfer, kernel compilation)
OpenCL C: Used on the device (Kernel Language)
Some words about our memories:
Global Memory: Shared with all Device, but slow. And is persistent between kernel calls.
Constant Memory: Faster than global memory, use it for filter parameters
Local Memory: Private to each compute unit, and shared to all processing elements.
Private Memory: Faster but local to each processing element.
The Constant, Local, and private memory are scratch space so each you cannot save data there to be used by other kernels.
If you are coming from the CUDA word, this is how the OpenCl model fit on Cuda compute architecture.
OpenCl applications run on the Host, which submit work to the compute devices.
Work Item: Basic unit of work on a compute device
Kernel: The code that runs on a work item (Basically a C function)
Program: Collection of kernels and other functions
Context: The environment where work-items execute (Devices, their memories and command queues)
Command Queue: Queue used by the host to submit work (kernels, memory copies) to the device.
It's a framework that define how kernel execute on each point on a problem (N-Dimension vector). Or can be seen as the decomposition of a task in work-items.
What need to be defined:
Global Work-size: Number of elements on your input vector.
Global offset
Work-group size: Size of your compute partition.
On an ideal case, you would have as infinite processing elements(PE) so each PE would handle one item of your data and they will never need to communicate. This is almost never the case, so you need to partition your work.
Partition global work into smaller pieces
Each partition is called Work-group
Work-Groups are schedule to execute on Compute Units
Each Work Group has a shared memory (Same as Local memory on Compute Unit)
The Work-Items on Work-Groups are mapped directly to Processing Elements on Compute Units.
To Resume Work-Groups are schedule to Compute Unit, and Work-Item execute inside Processing Element.
Consider the case bellow where you have a vector of 24 elements and your work-group size is 8. Open CL will automatically partition your data like this.
Each work-group will be map to a compute unit if you have enough compute units.
This process is done automatically by opencl you just need to give the global work size (24) and the work-group size (8). Then on an ideal case each PE get one item of your work-group.
Now what would happen if you don't have enough compute united (Also a normal case). If this happen then OpenCL will give each work-group to this compute unit one by one on a serial fashion. The beauty of this is that if you just plug a better hardware (More compute units) performance will be scalled.
Instead of dealing with 1d Vectors you can also work/think on 2d.
Another thing to point out is that work-items can only communicate with they are on the same work-group. Different work-groups cannot communicate. The only option would be to break the problem on 2 kernels and use the global memory but this will be to slow.
To launch a kernel you will use the function clEnqueueNDRangeKernel on this function you will use the parameters global_work_size and local_work_size. The global_work_size parameter define the total number of (work-items/threads) that will be launched, each one executing separatelly and getting one chunch of your input data. The local_work_size parameter define the size in number of (work-items/threads) of your work-group. The work-group can share local variables.
The number of work-groups is defined by dividing the global_work_size by the __local_work_size.
Each work-group is executed on a compute-unit which is able to handle a various of work-items, not only one. So it's not a good idea to have several small work-groups. On other-hand if you have a work-sgroup with too many work-items you will loose parallelization, the best ration is found by trial and error.
The threads that work on the same compute-unit can share variables, because they will be executing on the same work-group.
The work-group size can be calculated automatically (Not always the best) if you pass the parameter NULL instead of local_work_size.
Some notes:
A Kernel is invoked once for each work item. Each work item has private memory.
Work items are grouped into a work group. Each work group shares local memory
The total number of all work items is specified by the global work size.
global and constants memory is shared across all work work items of all work groups.
OpenCl gives a set of functions to control devices on your system. The Device does not know what to do the Host-API control the whole system. Bellow we have the major components of the Host-API
Platform
Context
Programs
Asynchronous Device calls
Platforms are OpenCl implementation. Imagine as a device driver that expose the devices that are available on the Heterogeneous for you. For example a desktop computer with 2 Gpus, 1 FPGA card and one big 32 cores CPU. The Platform API discover the devices available to you.
The Context allows you to group multiple devices on some specific platform. Imagine that you want to make available all the compute units of all devices. The context have devices and memory.
Just a collection of kernels, that need to be compiled and/or loaded.
The Host API provide functions to issue commands to the devices (clEnque*). Those functions are asynchronous so the host will not freeze while the devices is executing a command.
Most of the OpenCl Host programs has the following structure
In terms of source code.
Actually this big code always follow the same structure 1. Define (query) platform and create a command queue 2. Define memory objects (clCreateBuffer) 3. Create Program (library of kernels) 4. Build the program 5. Setup kernel (clCreateKernel, clSetKernelArg) 6. Get the results back to the Host
On OpenCl the devices will execute kernels, those kernels are small functions written in OpenCl C which is a C (C99) subset. Kernels are an entry point (like the main function) for a device execution. The kernels are loaded and prepared by the Host.
Here are the main differences between C and OpenCl C:
No function pointers
No recursion
Has vector types
Has image types
Allow structures but kill performance, and communication with host could be complicated.
There is no mechanism for different kernels to cooperate
The kernel arguments will be pointers to the global memory or some values given.
The only point to pay attention is that integer types are represented as two's complement and his may differ on the host.
Type
Size
char
8-bit integer
short
16-bit integer
int
32-bit integer
long
64-bit integer
float
32-bit single precision
double
64-bit double precision
Signed Type
Unsigned Type
charN
ucharN
shortN
ushortN
intN
uintN
longN
ulongN
floatN
-
doubleN
-
Mixing scalar and vectors
The diagram bellow show how to define the memory regions (Local, Private, Constant, Global)
Remember that the work-group partition is automatically made by OpenCl. We just choose the work-group size.
get_global_id(n): Get work-item id on the dimension (n) for the work-group.
get_global_offset(n): Get global offser on dimension (n)
get_local_id(n): Which work-item am I inside the work-group on dimension (n)
For instance imagine a function that need to calculate the addition of 2 1d Vectors of size 10.000000 (10 million elements). On normal programming this would be something like:
On OpenCl Kernels we have the option to execute this function on multiple processing elements. Ideally you could have one processing element for each vector element. On this case the whole operation would take 1 cycle. The truth is that frequently you have less processing elements than elements to work with. Observe that we want to substitute the for-loops by the parallel execution of multiple iterations of the previous loop.
As mentioned before there is no way to synchronize work-items(threads) on different work-groups (different work-groups will be executed on different compute units). But inside the same workgroup OpenCl gives to options:
mem-fences
barriers
Both of those commands has as parameter CLK_LOCAL_MEM_FENCE and/or CLK_GLOBAL_MEM_FENCE. The usage of those commands also has some impact on performance, but sometimes you really need them.
Waits until all reads/writes to local or global memory made by the calling work-item prior to mem_fence() are visible to all work-items(threads) in the work-group. Basically enforces that some change made by the current work-item become available to all work-group.
Waits until all work-items in the work-group have reached this point and calls mem_fence to insure that all work-items(threads) see the same data.
Cuda
OpenCl
Streaming Multi Processor(SM)
Compute Unit
Streaming Processor(SP)
Processing Element
Global Memory
Global Memory
Shared Memory
Local Memory
Local Memory
Private Memory
Kernel
Kernel
Warp
Wavefront
Thread
Work-item
Block
Work-group
And 3d
The biggest Work-Group size is device specific.
OpenCl support vector types with sizes . This will allow OpenCl to use vectorized instructions of the device technology, for instance Neon instructions.