Juno Plarform
Last updated
Last updated
This is the development platform that we use on this book.
Some features on Juno
2 Fast Cortex A57 cores (1.1 Ghz)
4 Cortex A53 cores (850 Mhz)
4 Mali shaders cores (500 Mhz)
8Gb RAM DDR3
1Gbit Ethernet connection
On our case we're interested on the Mali-T624
Basically is a GPU composed with 4 cores (Compute Units) running at 500Mhz.
The Mali-T600, -T700 and -T800 series of GPUs are not wavefront based. With each thread having its own program counter, threads are entirely independent of each other so the above technique runs fine. In other words, we really have 4 independent cores.
On the juno platform the GPU memory is shared with the CPU (8gb DDR3), so memory transfers are faster, in theory it's just like a memcpy plus the time to invalidate the CPU cache.
Basically you need 2 things
Have The Mali OpenCl SDK installed
Have the rootfs with the proper Mali drivers: If not get driver here
If we compile and execute our queryHost example on the Juno platform with the Mali drivers installed this is what we get.
Every Mali-T600 thread has its own independent program counter (Warp size 1)
OpenCL barrier operations (which synchronise threads) are handled by the hardware
For full efficiency you need more work-groups than cores
When running on Mali just use global memory
Mali prefers explicit vector functions
All CL memory buffers resides in global memory that is accessible by both CPU and GPU cores.
When we use a barrier the thread will enter the Texturing Pipeline, and will take much more cycles to complete. That's why it's preferable to use atomic functions, for synchronization.
Each ALU can make 17 float point operations per cycle, we have one per/core.
It's the most first way to improve performance.
The Mali-T624 can provide up to 256 "threads" or work-items divided between the 4-cores.
On the current Mali architecture the cores does not have a local-cached memory, so there is no real advantage on using local memory optimizations.
As ARM states to avoid using barriers on Mali Gpus, we should use kernels without this type of synchronization because it will hurt performance. You should try to use atomic functions instead.
Observe that even with Juno having multiple ARM cores, they are not available to the OpenCl platforms. On this case we still need to cross-compile an OpenCl driver for the ARM cores using the POCL project. Then you also need an OpenCL Installable Client Driver (ICD) Loader, to bridge different devices on the same platform. Some instructions to build the OpenCl ICD can be found here.
So basically now we choose 4x less threads (globalsize/4), less global access and more operations. (This actually hurt performance on NVIDIA)