Deep Learning SpeedUp
Last updated
Last updated
On this chapter we're going to see how can we accelerate some Deep learning operations using the Mali Gpu on the Juno Platform. (2x 1.1Ghz cpu cores). If you are not familiar with Deep learning concepts you may refer another book here.
Our problem is to improve the performance of the Residual net, where the input is a 120x120x3 RGB image. Note that on our case we don't use batch norm blocks. And we do 52 convolutions, max max-pooling, 1 inner product operation.
Normally on deep learning models, most of the time is spent on convolutions. Actually this is the first target that we want to accelerate.
Originally generating C code from matlab with no compiling optimization (-O0). The forward propagation takes 16 seconds to compute. Using compile optimization (-O3) and asking matlab to prioritize code efficiency this time go to 4 seconds. Which is cool due to the zero effort but still to slow.
No profile and optimization
No using optimization
Full debug and profiling
Now let's take a closer look on the profile (gprof). The first thing that we can observe is that convolutions take 90% of the computational time. The function resnet500step call all the functions but it spend most of time loading the model parameters. Our worst-case was the function _resnet500_forward_conv_gp
Now let's locate our model where this function is located. On our case the model is in Simulink so it will be easy to locate since the source code point to the original model.
Here we run valgrind(callgrind) on the host machine. It also has the same hotspots compared to the version working on the Juno board. resnet500_forward_conv_gp
On the source code we can see that indeed the matrix multiplication part is taking most of the execution time
On our network the first matrix multiplication A[64x147], B[147x3600]. Running the profiler for a naive implementation running on Juno (CPU).
Which means that we spend 961.86 milliseconds. If we measure the time with our naive OpenCl implementation this time goes down to 41.4 milliseconds, 23x faster.
From the profiler output we detect that the worst case of convolution was the resnet500_forward_conv_gp. It involve the im2col operation and a matrix multiplication A[256x2304] with B[2304x64].
Just the matrix multiplication takes 46ms which is already 12x faster than pure CPU.
So we know that our hotspot is on the convolution side. So we're going to create a simper model only with the worst case convolution (resnet500_forward_conv_gp). From there we will look where to improve.