# Deep Learning SpeedUp

On this chapter we're going to see how can we accelerate some Deep learning operations using the Mali Gpu on the Juno Platform. (2x 1.1Ghz cpu cores). If you are not familiar with Deep learning concepts you may refer another book [here](https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/).

## Use case Residual network

Our problem is to improve the performance of the Residual net, where the input is a 120x120x3 RGB image. Note that on our case we don't use batch norm blocks. And we do 52 convolutions, max max-pooling, 1 inner product operation.

![](/files/-LvMRpRPSIxDaWzdd4PX)

### Time distribution

Normally on deep learning models, most of the time is spent on convolutions. Actually this is the first target that we want to accelerate.

![](/files/-LvMRpRRWpqzFBR3SDmz)

Originally generating C code from matlab with no compiling optimization (-O0). The forward propagation takes 16 seconds to compute. Using compile optimization (-O3) and asking matlab to prioritize code efficiency this time go to 4 seconds. Which is cool due to the zero effort but still to slow.

![](/files/-LvMRpRTh35e02J3_86P)

No profile and optimization

```bash
# time ./resnet500    
** created resnet500.mat **
real    0m15.894s
user    0m15.660s
sys    0m0.230s
```

No using optimization

```bash
# time ./resnet500_O3 
** created resnet500.mat **
real    0m3.993s
user    0m3.710s
sys    0m0.280s
```

Full debug and profiling

```bash
# time ./resnet500_O0_Profiling 
** created resnet500.mat **
real    0m17.551s
user    0m17.270s
sys    0m0.270s
```

### Running gprof

Now let's take a closer look on the profile (gprof). The first thing that we can observe is that convolutions take 90% of the computational time. The function resnet500*step call all the functions but it spend most of time loading the model parameters. Our worst-case was the function \_resnet500\_forward\_conv\_gp*

```bash
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 19.17      3.29     3.29        6     0.55     0.55  resnet500_forward_conv_gp
 11.89      5.33     2.04        4     0.51     0.52  resnet500_forward_conv_h
 10.14      7.07     1.74        3     0.58     0.58  resnet500_forward_conv_k
  9.03      8.62     1.55        3     0.52     0.54  resnet500_forward_conv
  7.46      9.90     1.28        6     0.21     0.22  resnet500_forwardConvolution_a
  7.11     11.12     1.22        1     1.22    17.16  resnet500_step
  6.12     12.17     1.05        5     0.21     0.21  resnet500_forward_conv_d
  5.01     13.03     0.86        4     0.22     0.23  resnet500_forwardConvolution
  4.55     13.81     0.78        4     0.20     0.20  resnet500_forwardConvolution_g
  3.67     14.44     0.63        3     0.21     0.21  resnet500_forwardConvolution_gk
  3.55     15.05     0.61        1     0.61     0.65  resnet500_forward_conv_h0
  3.26     15.61     0.56        3     0.19     0.19  resnet500_forward_conv_l
  2.56     16.05     0.44        2     0.22     0.22  resnet500_forward_conv_lk
  2.16     16.42     0.37        2     0.19     0.20  resnet500_forward_conv_g
  0.64     16.53     0.11        1     0.11     0.11  resnet500_forward_conv_fq
  0.58     16.63     0.10        1     0.10     0.10  resnet500_forward_conv_ki
  0.58     16.73     0.10        1     0.10     0.10  resnet500_forward_conv_f
  0.29     16.78     0.05        3     0.02     0.02  resnet500_im2col_ref_p
  0.29     16.83     0.05        1     0.05     0.05  resnet500_forward_conv_j
```

#### Locating on the model

Now let's locate our model where this function is located. On our case the model is in Simulink so it will be easy to locate since the source code point to the original model.

![](/files/-LvMRpRV-gFNIpA44yBw)

![](/files/-LvMRpRX01r3hwl3Z8uW)

### Checking with Valgrind

Here we run valgrind(callgrind) on the host machine. It also has the same hotspots compared to the version working on the Juno board. **resnet500\_forward\_conv\_gp**

![](/files/-LvMRpRZaoBmyMm86pJ7)

On the source code we can see that indeed the matrix multiplication part is taking most of the execution time

![](/files/-LvMRpRaGH2wtSIrcGV5)

### First matrix multiplication

On our network the first matrix multiplication A\[64x147], B\[147x3600]. Running the profiler for a naive implementation running on Juno (CPU).

```bash
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 99.16      0.96     0.96        1   961.87   961.87  matrix_2d_mul_float(float*, float*, float*, int, int, int)
  1.03      0.97     0.01                             main
  0.00      0.97     0.00        2     0.00     0.00  fillRand(float*, int, int, int)
  0.00      0.97     0.00        1     0.00     0.00  _GLOBAL__sub_I_num_rows_A
  0.00      0.97     0.00        1     0.00     0.00  __static_initialization_and_destruction_0(int, int)
```

Which means that we spend 961.86 milliseconds. If we measure the time with our naive OpenCl implementation this time goes down to 41.4 milliseconds, 23x faster.

### Other matrix multiplication

From the profiler output we detect that the worst case of convolution was the **resnet500\_forward\_conv\_gp**. It involve the im2col operation and a matrix multiplication A\[256x2304] with B\[2304x64].

Just the matrix multiplication takes 46ms which is already 12x faster than pure CPU.

```bash
Multiplying 2 matrices A[256,2304] * B[2304,64]
Size in bytes A: 2359296
Size in bytes B: 589824
Size in bytes C: 65536
Initializing OpenCL device...
Compiling OpenCL kernel...
Global size[256, 64]
Matrix multiplication done 0
Matrix multiplication done 1
Matrix multiplication done 2
Matrix multiplication done 3
Matrix multiplication done 4
Matrix multiplication done 5
Matrix multiplication done 6
Matrix multiplication done 7
Matrix multiplication done 8
Matrix multiplication done 9

Kernel Execution time = 45.749 ms
```

## What to do now.

So we know that our hotspot is on the convolution side. So we're going to create a simper model only with the worst case convolution (**resnet500\_forward\_conv\_gp**). From there we will look where to improve.

![](/files/-LvMRpRcaqIzVOi5KLFM)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://leonardoaraujosantos.gitbook.io/opencl/deep_learning_speedup.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
