This chapter will explain how to implement the convolution layer on python and matlab.

In simple terms the convolution layer, will apply the convolution operator on all images on the input tensor, and also transform the input depth to match the number of filters. Bellow we explain it's parameters and signals:

N: Batch size (Number of images on the 4d tensor)

F: Number of filters on the convolution layer

kW/kH: Kernel Width/Height (Normally we use square images, so kW=kH)

H/W: Image height/width (Normally H=W)

H'/W': Convolved image height/width (Remains the same as input if proper padding is used)

Stride: Number of pixels that the convolution sliding window will travel.

Padding: Zeros added to the border of the image to keep the input and output size the same.

Depth: Volume input depth (ie if the input is a RGB image depth will be 3)

Output depth: Volume output depth (same as F)

On the forward propagation, you must remember, that we're going to "convolve" each input depth with a different filter, and each filter will look for something different on the image.

Here observe that all neurons(flash-lights) from layer 1 share the same set of weights, other filters will look for different patterns on the image.

Basically we can consider the previous "convn_vanilla" function on the Convolution chapter and apply for each depth on the input and output.

The only point to observe here is that due to the way the multidimensional arrays are represented in python our tensors will have different order.

In order to derive the convolution layer back-propagation it's easier to think on the 1d convolution, the results will be the same for 2d.

So doing a 1d convolution, between a signal $X=[x0,x1,x2,x3,x4]$ and $W=[w0,w1,w2]$, and without padding we will have $Y=[y0,y1,y2]$, where $Y = X * flip(W)$. Here flip can be consider as a 180 degrees rotation.

Now we convert all the "valid cases" to a computation graph, observe that for now we're adding the bias because it is used on the convolution layer.

Observe that the graphs are basically the same as the fully connected layer, the only difference is that we have shared weights.

Now changing to the back-propagation

If you follow the computation graphs backward, as was presented on the Backpropagation chapter we will have the following formulas for $\frac{\partial L}{\partial X}$, which means how the loss will change with the input X $\frac{\partial L}{\partial x_0}=(w2.dout_{y0})\\ \frac{\partial L}{\partial x_1}=(w1.dout_{y0})+(w2.dout_{y1})\\ \frac{\partial L}{\partial x_2}=(w0.dout_{y0})+(w1.dout_{y1}) + (w2.dout_{y2})\\ \frac{\partial L}{\partial x_3}=(w0.dout_{y1})+(w1.dout_{y2}) \\ \frac{\partial L}{\partial x_4}=(w0.dout_{y2})$

Now consider some things: 1. dX must have the same size of X, so we need padding 2. dout must have the same size of Y, which in this case is 3 (Gradient input) 3. To save programming effort we want to calculate the gradient as a convolution 4. On dX gradient all elements are been multiplied by W so we're probably convolving W and dout

Following the output size rule for the 1d convolution: $outputSize=(InputSize-KernelSize+2P)+1$ Our desired size is 3, our original input size is 3, and we're going to convolve with the W matrix that also have 3 elements. So we need to pad our input with 2 zeros.

The convolution above implement all calculations needed for $\frac{\partial L}{\partial X}$, so in terms of convolution: $\Large\frac{\partial L}{\partial X}=\underbrace{dout}_\text{zero padded} * \overbrace{W}^\text{flipped K or W}$

Now let's continue for $\frac{\partial L}{\partial W}=[\partial w_0, \partial w_1, \partial w_2]$, considering that they must have the same size as W. $\frac{\partial L}{\partial w_0}=(x2.dout_{y0})+(x3.dout_{y1}) + (x4.dout_{y2})\\ \frac{\partial L}{\partial w_1}=(x1.dout_{y0})+(x2.dout_{y1}) + (x3.dout_{y2})\\ \frac{\partial L}{\partial w_2}=(x0.dout_{y0})+(x1.dout_{y1}) + (x2.dout_{y2})$

Again by just looking to the expressions that we took from the graph we can see that is possible to represent them as a convolution between dout and X. Also as the output will be 3 elements, there is no need to do padding.

So in terms of convolution the calculations for $\frac{\partial L}{\partial W}$ will be: $\Large\frac{\partial L}{\partial W}=\underbrace{\hat{X}}_\text{flipped X} * dout$ Just one point to remember, if you consider X to be the kernel, and dout the signal, X will be automatically flipped. $\Large\frac{\partial L}{\partial W}=dout * X$

Now for the bias, the calculation will be similar to the Fully Connected layer. Basically we have one bias per filter (depth) $\Large \frac{\partial L}{\partial b}=\begin{bmatrix} \sum_{batch}(dout_{y0}) & \sum_{batch}(dout_{y1}) & \sum_{batch}(dout_{y2}) \end{bmatrix}$

Before jumping to the code some points need to be reviewed:

If you use some parameter (ie: Stride/Pad) during forward propagation you need to apply them on the backward propagation.

On Python our multidimensional tensor will be "input=[N x Depth x H x W]" on matlab they will be "input=[H x W x Depth x N]"

As mentioned before the gradients of a input, has the same size as the input itself "size(x)==size(dx)"

Next chapter we will learn about Pooling layer