Deep Learning SpeedUp

On this chapter we're going to see how can we accelerate some Deep learning operations using the Mali Gpu on the Juno Platform. (2x 1.1Ghz cpu cores). If you are not familiar with Deep learning concepts you may refer another book here.

Use case Residual network

Our problem is to improve the performance of the Residual net, where the input is a 120x120x3 RGB image. Note that on our case we don't use batch norm blocks. And we do 52 convolutions, max max-pooling, 1 inner product operation.

Time distribution

Normally on deep learning models, most of the time is spent on convolutions. Actually this is the first target that we want to accelerate.

Originally generating C code from matlab with no compiling optimization (-O0). The forward propagation takes 16 seconds to compute. Using compile optimization (-O3) and asking matlab to prioritize code efficiency this time go to 4 seconds. Which is cool due to the zero effort but still to slow.

No profile and optimization

# time ./resnet500    
** created resnet500.mat **
real    0m15.894s
user    0m15.660s
sys    0m0.230s

No using optimization

# time ./resnet500_O3 
** created resnet500.mat **
real    0m3.993s
user    0m3.710s
sys    0m0.280s

Full debug and profiling

# time ./resnet500_O0_Profiling 
** created resnet500.mat **
real    0m17.551s
user    0m17.270s
sys    0m0.270s

Running gprof

Now let's take a closer look on the profile (gprof). The first thing that we can observe is that convolutions take 90% of the computational time. The function resnet500step call all the functions but it spend most of time loading the model parameters. Our worst-case was the function _resnet500_forward_conv_gp

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 19.17      3.29     3.29        6     0.55     0.55  resnet500_forward_conv_gp
 11.89      5.33     2.04        4     0.51     0.52  resnet500_forward_conv_h
 10.14      7.07     1.74        3     0.58     0.58  resnet500_forward_conv_k
  9.03      8.62     1.55        3     0.52     0.54  resnet500_forward_conv
  7.46      9.90     1.28        6     0.21     0.22  resnet500_forwardConvolution_a
  7.11     11.12     1.22        1     1.22    17.16  resnet500_step
  6.12     12.17     1.05        5     0.21     0.21  resnet500_forward_conv_d
  5.01     13.03     0.86        4     0.22     0.23  resnet500_forwardConvolution
  4.55     13.81     0.78        4     0.20     0.20  resnet500_forwardConvolution_g
  3.67     14.44     0.63        3     0.21     0.21  resnet500_forwardConvolution_gk
  3.55     15.05     0.61        1     0.61     0.65  resnet500_forward_conv_h0
  3.26     15.61     0.56        3     0.19     0.19  resnet500_forward_conv_l
  2.56     16.05     0.44        2     0.22     0.22  resnet500_forward_conv_lk
  2.16     16.42     0.37        2     0.19     0.20  resnet500_forward_conv_g
  0.64     16.53     0.11        1     0.11     0.11  resnet500_forward_conv_fq
  0.58     16.63     0.10        1     0.10     0.10  resnet500_forward_conv_ki
  0.58     16.73     0.10        1     0.10     0.10  resnet500_forward_conv_f
  0.29     16.78     0.05        3     0.02     0.02  resnet500_im2col_ref_p
  0.29     16.83     0.05        1     0.05     0.05  resnet500_forward_conv_j

Locating on the model

Now let's locate our model where this function is located. On our case the model is in Simulink so it will be easy to locate since the source code point to the original model.

Checking with Valgrind

Here we run valgrind(callgrind) on the host machine. It also has the same hotspots compared to the version working on the Juno board. resnet500_forward_conv_gp

On the source code we can see that indeed the matrix multiplication part is taking most of the execution time

First matrix multiplication

On our network the first matrix multiplication A[64x147], B[147x3600]. Running the profiler for a naive implementation running on Juno (CPU).

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 99.16      0.96     0.96        1   961.87   961.87  matrix_2d_mul_float(float*, float*, float*, int, int, int)
  1.03      0.97     0.01                             main
  0.00      0.97     0.00        2     0.00     0.00  fillRand(float*, int, int, int)
  0.00      0.97     0.00        1     0.00     0.00  _GLOBAL__sub_I_num_rows_A
  0.00      0.97     0.00        1     0.00     0.00  __static_initialization_and_destruction_0(int, int)

Which means that we spend 961.86 milliseconds. If we measure the time with our naive OpenCl implementation this time goes down to 41.4 milliseconds, 23x faster.

Other matrix multiplication

From the profiler output we detect that the worst case of convolution was the resnet500_forward_conv_gp. It involve the im2col operation and a matrix multiplication A[256x2304] with B[2304x64].

Just the matrix multiplication takes 46ms which is already 12x faster than pure CPU.

Multiplying 2 matrices A[256,2304] * B[2304,64]
Size in bytes A: 2359296
Size in bytes B: 589824
Size in bytes C: 65536
Initializing OpenCL device...
Compiling OpenCL kernel...
Global size[256, 64]
Matrix multiplication done 0
Matrix multiplication done 1
Matrix multiplication done 2
Matrix multiplication done 3
Matrix multiplication done 4
Matrix multiplication done 5
Matrix multiplication done 6
Matrix multiplication done 7
Matrix multiplication done 8
Matrix multiplication done 9

Kernel Execution time = 45.749 ms

What to do now.

So we know that our hotspot is on the convolution side. So we're going to create a simper model only with the worst case convolution (resnet500_forward_conv_gp). From there we will look where to improve.

Last updated