Juno Plarform

Introduction

This is the development platform that we use on this book.

Some features on Juno

2 Fast Cortex A57 cores (1.1 Ghz)
4 Cortex A53 cores (850 Mhz)
4 Mali shaders cores (500 Mhz)
8Gb RAM DDR3
1Gbit Ethernet connection

On our case we're interested on the Mali-T624

Basically is a GPU composed with 4 cores (Compute Units) running at 500Mhz.

No Wavefront (So no divergence)

The Mali-T600, -T700 and -T800 series of GPUs are not wavefront based. With each thread having its own program counter, threads are entirely independent of each other so the above technique runs fine. In other words, we really have 4 independent cores.

Memory consideration

On the juno platform the GPU memory is shared with the CPU (8gb DDR3), so memory transfers are faster, in theory it's just like a memcpy plus the time to invalidate the CPU cache.

Steps to use OpenCL on mali

Basically you need 2 things

Have The Mali OpenCl SDK installed
Have the rootfs with the proper Mali drivers: If not get driver here

Device Info on Juno (Mali-T624)

If we compile and execute our queryHost example on the Juno platform with the Mali drivers installed this is what we get.

root@genericarmv8:~/work# ./queryHost

Number of OpenCL platforms found: 1
CL_PLATFORM_PROFILE: FULL_PROFILE
CL_PLATFORM_VERSION: OpenCL 1.2 v1.r10p0-00rel0.83e65da3dbe0d5979ba9881967b24b6f
CL_PLATFORM_NAME: ARM Platform
CL_PLATFORM_VENDOR: ARM
CL_PLATFORM_EXTENSIONS: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_arm_core_id cl_arm_printf cl_arm_thread_limit_hint cl_arm_non_uniform_work_group_size cl_arm_import_memory
Number of detected OpenCL devices: 1
GPU detected
    Device name is Mali-T624
    Device vendor is ARM
    VENDOR ID: 0x6200010
    Device max memory allocation: 2009 mega-bytes
    Device global cacheline size: 64 bytes
    Device global mem: 8038 mega-bytes
    Maximum number of parallel compute units: 4
    Maximum dimensions for global/local work-item IDs: 3
    Maximum number of work-items in each dimension: ( 256 256 256  )
    Maximum number of work-items in a work-group: 256

Mali OpenCl execution model

Every Mali-T600 thread has its own independent program counter (Warp size 1)
OpenCL barrier operations (which synchronise threads) are handled by the hardware
For full efficiency you need more work-groups than cores
When running on Mali just use global memory
Mali prefers explicit vector functions
All CL memory buffers resides in global memory that is accessible by both CPU and GPU cores.

Inside a Mali Core

When we use a barrier the thread will enter the Texturing Pipeline, and will take much more cycles to complete. That's why it's preferable to use atomic functions, for synchronization.

ALU

Each ALU can make 17 float point operations per cycle, we have one per/core.

Use vector operations

It's the most first way to improve performance.

So basically now we choose 4x less threads (globalsize/4), less global access and more operations. (This actually hurt performance on NVIDIA)

Max number of work-items

The Mali-T624 can provide up to 256 "threads" or work-items divided between the 4-cores.

__local memory

On the current Mali architecture the cores does not have a local-cached memory, so there is no real advantage on using local memory optimizations.

Synchronization

As ARM states to avoid using barriers on Mali Gpus, we should use kernels without this type of synchronization because it will hurt performance. You should try to use atomic functions instead.

Bridge different devices

Observe that even with Juno having multiple ARM cores, they are not available to the OpenCl platforms. On this case we still need to cross-compile an OpenCl driver for the ARM cores using the POCL project. Then you also need an OpenCL Installable Client Driver (ICD) Loader, to bridge different devices on the same platform. Some instructions to build the OpenCl ICD can be found here.

PreviousOpenCl with Matlab NextDeep Learning SpeedUp

Last updated 5 years ago

Was this helpful?