Fully Connected Layer

Introduction

This chapter will explain how to implement in matlab and python the fully connected layer, including the forward and back-propagation.

First consider the fully connected layer as a black box with the following properties: On the forward propagation 1. Has 3 inputs (Input signal, Weights, Bias) 2. Has 1 output

On the back propagation 1. Has 1 input (dout) which has the same size as output 2. Has 3 (dx,dw,db) outputs, that has the same size as the inputs

Neural network point of view

Just by looking the diagram we can infer the outputs:

y1=[(w11.x1)+(w12.x2)+(w13.x3)]+b1y2=[(w21.x1)+(w22.x2)+(w23.x3)]+b2y_1=[(w_{11}.x_1)+(w_{12}.x_2)+(w_{13}.x_3)] + b1\\ y_2=[(w_{21}.x_1)+(w_{22}.x_2)+(w_{23}.x_3)] + b2

Now vectorizing (put on matrix form): (Observe 2 possible versions)

[w11w12w13w21w22w23]One collumn per x dimension.[x1x2x3]+[b1b2]=[y1y2]H(X)=(W.x)+bTH(X)=(WT.x)+b\underbrace{\begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \end{bmatrix}}_{\text{One collumn per x dimension}}. \begin{bmatrix} x_{1} \\ x_{2} \\ x_{3} \\ \end{bmatrix}+\begin{bmatrix} b_{1} \\ b_{2} \end{bmatrix}=\begin{bmatrix} y_{1} \\ y_{2} \end{bmatrix} \therefore \\ H(X) = (W.x)+b^T \\ H(X) = (W^T.x)+b

Depending on the format that you choose to represent W attention to this because it can be confusing.

For example if we choose X to be a column vector, our matrix multiplication must be:

([x1x2x3].[w11w21w12w22w13w23])+[b1b2]=[y1y2]H(X)=(x.Wt)+b(\begin{bmatrix} x_{1} & x_{2} & x_{3} \end{bmatrix}.\begin{bmatrix} w_{11} & w_{21} \\ w_{12} & w_{22} \\ w_{13} & w_{23} \end{bmatrix})+\begin{bmatrix} b_{1} & b_{2}\end{bmatrix}=\begin{bmatrix} y_{1} & y_{2}\end{bmatrix} \therefore \\ H(X)=(x.W^t)+b

Computation graph point of view

In order to discover how each input influence the output (backpropagation) is better to represent the algorithm as a computation graph.

Now for the backpropagation let's focus in one of the graphs, and apply what we learned so far on backpropagation.

Summarizing the calculation for the first output (y1), consider a global error L(loss) and douty1=Ly1dout_{y1}=\frac{\partial L}{\partial y_1}

Lx1=douty1.w11Lx2=douty1.w12Lx3=douty1.w13\Large \frac{\partial L}{\partial x_1}=dout_{y1}.w11\\ \Large \frac{\partial L}{\partial x_2}=dout_{y1}.w12\\ \Large \frac{\partial L}{\partial x_3}=dout_{y1}.w13

Lw11=douty1.x1Lw12=douty1.x2Lw13=douty1.x3\Large \frac{\partial L}{\partial w_{11}}=dout_{y1}.x1\\ \Large \frac{\partial L}{\partial w_{12}}=dout_{y1}.x2\\ \Large \frac{\partial L}{\partial w_{13}}=dout_{y1}.x3

Lb1=douty1\Large \frac{\partial L}{\partial b_1}=dout_{y1}

Also extending to the second output (y2)

Lx1=douty2.w21Lx2=douty2.w22Lx3=douty2.w23\Large \frac{\partial L}{\partial x_1}=dout_{y2}.w21\\ \Large \frac{\partial L}{\partial x_2}=dout_{y2}.w22\\ \Large \frac{\partial L}{\partial x_3}=dout_{y2}.w23

Lw21=douty2.x1Lw22=douty2.x2Lw23=douty2.x3\Large \frac{\partial L}{\partial w_{21}}=dout_{y2}.x1\\ \Large \frac{\partial L}{\partial w_{22}}=dout_{y2}.x2\\ \Large \frac{\partial L}{\partial w_{23}}=dout_{y2}.x3

Lb2=douty2\Large \frac{\partial L}{\partial b_2}=dout_{y2}

Merging the results, for dx:

Lx1=[douty1.w11+douty2.w21]Lx2=[douty1.w12+douty2.w22]Lx3=[douty1.w13+douty2.w23]\frac{\partial L}{\partial x1}=[dout_{y1}.w11+dout_{y2}.w21]\\ \frac{\partial L}{\partial x2}=[dout_{y1}.w12+dout_{y2}.w22]\\ \frac{\partial L}{\partial x3}=[dout_{y1}.w13+dout_{y2}.w23]

On the matrix form

LX=[douty1douty2].[w11w12w13w21w22w23]\frac{\partial L}{\partial X}=\begin{bmatrix} dout_{y1} & dout_{y2} \end{bmatrix}.\begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \end{bmatrix}, or LX=[w11w21w12w22w13w23].[douty1douty2]\frac{\partial L}{\partial X}=\begin{bmatrix} w_{11} & w_{21} \\ w_{12} & w_{22} \\ w_{13} & w_{23} \end{bmatrix}. \begin{bmatrix} dout_{y1} \\ dout_{y2} \end{bmatrix}.

Depending on the format that you choose to represent X (as a row or column vector), attention to this because it can be confusing.

Now for dW It's important to not that every gradient has the same dimension as it's original value, for instance dW has the same dimension as W, in other words:

W=[w11w12w13w21w22w23]LW=[Lw11Lw12Lw13Lw21Lw22Lw23]W=\begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \end{bmatrix} \therefore \frac{\partial L}{\partial W}=\begin{bmatrix} \frac{\partial L}{\partial w_{11}} & \frac{\partial L}{\partial w_{12}} & \frac{\partial L}{\partial w_{13}} \\ \frac{\partial L}{\partial w_{21}} & \frac{\partial L}{\partial w_{22}} & \frac{\partial L}{\partial w_{23}} \end{bmatrix}

LW=[douty1douty2].[x1x2x3]=[Lw11Lw12Lw13Lw21Lw22Lw23]\frac{\partial L}{\partial W}=\begin{bmatrix} dout_{y1} \\ dout_{y2} \end{bmatrix}.\begin{bmatrix} x_{1} && x_{2} && x_{3} \end{bmatrix}=\begin{bmatrix} \frac{\partial L}{\partial w_{11}} & \frac{\partial L}{\partial w_{12}} & \frac{\partial L}{\partial w_{13}} \\ \frac{\partial L}{\partial w_{21}} & \frac{\partial L}{\partial w_{22}} & \frac{\partial L}{\partial w_{23}} \end{bmatrix}

And dB Lb=[douty1douty2]\Large \frac{\partial L}{\partial b}=\begin{bmatrix} dout_{y1} & dout_{y2} \end{bmatrix}

Expanding for bigger batches

All the examples so far, deal with single elements on the input, but normally we deal with much more than one example at a time. For instance on GPUs is common to have batches of 256 images at the same time. The trick is to represent the input signal as a 2d matrix [NxD] where N is the batch size and D the dimensions of the input signal. So if you consider the CIFAR dataset where each digit is a 28x28x1 (grayscale) image D will be 784, so if we have 10 digits on the same batch our input will be [10x784].

For the sake of argument, let's consider our previous samples where the vector X was represented like X=[x1x2x3]X=\begin{bmatrix} x_1 & x_2 & x_3 \end{bmatrix}, if we want to have a batch of 4 elements we will have:

Xbatch=[x1sample1x2sample1x3sample1x1sample2x2sample2x3sample2x1sample3x2sample3x3sample3x1sample4x2sample4x3sample4]Xbatch=[4,3]X_{batch}=\begin{bmatrix} x_{1 sample 1} & x_{2 sample 1} & x_{3 sample 1} \\ x_{1 sample 2} & x_{2 sample 2} & x_{3 sample 2} \\ x_{1 sample 3} & x_{2 sample 3} & x_{3 sample 3} \\ x_{1 sample 4} & x_{2 sample 4} & x_{3 sample 4} \end{bmatrix} \therefore X_{batch}=[4,3]

In this case W must be represented in a way that support this matrix multiplication, so depending how it was created it may need to be transposed

WT=[w11w21w12w22w13w23]W^T=\begin{bmatrix} w_{11} & w_{21} \\ w_{12} & w_{22} \\ w_{13} & w_{23} \end{bmatrix}

Continuing the forward propagation will be computed as:

([x1sample1x2sample1x3sample1x1sample2x2sample2x3sample2x1sample3x2sample3x3sample3x1sample4x2sample4x3sample4].[w11w21w12w22w13w23])+[b1sample1b2sample1b1sample2b2sample2b1sample3b2sample3b1sample4b2sample4]=[y1sample1y2sample1y1sample2y2sample2y1sample3y2sample3y1sample4y2sample4](\begin{bmatrix} x_{1 sample 1} & x_{2 sample 1} & x_{3 sample 1} \\ x_{1 sample 2} & x_{2 sample 2} & x_{3 sample 2} \\ x_{1 sample 3} & x_{2 sample 3} & x_{3 sample 3} \\ x_{1 sample 4} & x_{2 sample 4} & x_{3 sample 4} \end{bmatrix}.\begin{bmatrix} w_{11} & w_{21} \\ w_{12} & w_{22} \\ w_{13} & w_{23} \end{bmatrix})+\begin{bmatrix} b_{1 sample 1} & b_{2 sample 1} \\ b_{1 sample 2} & b_{2 sample 2} \\ b_{1 sample 3} & b_{2 sample 3} \\ b_{1 sample 4} & b_{2 sample 4} \end{bmatrix}=\begin{bmatrix} y_{1 sample 1} & y_{2 sample 1} \\ y_{1 sample 2} & y_{2 sample 2} \\ y_{1 sample 3} & y_{2 sample 3} \\ y_{1 sample 4} & y_{2 sample 4}\end{bmatrix}

One point to observe here is that the bias has repeated 4 times to accommodate the product X.W that in this case will generate a matrix [4x2]. On matlab the command "repmat" does the job. On python it does automatically.

Using Symbolic engine

Before jumping to implementation is good to verify the operations on Matlab or Python (sympy) symbolic engine. This will help visualize and explore the results before acutally coding the functions.

Symbolic forward propagation on Matlab

Here after we defined the variables which will be symbolic, we create the matrix W,X,b then calculate y=(W.X)+by=(W.X)+b, compare the final result with what we calculated before.

Symbolic backward propagation on Matlab

Now we also confirm the backward propagation formulas. Observe the function "latex" that convert an expression to latex on matlab

(douty1x1douty1x2douty1x3douty2x1douty2x2douty2x3)\left(\begin{array}{ccc} \mathrm{douty1}\, \mathrm{x1} & \mathrm{douty1}\, \mathrm{x2} & \mathrm{douty1}\, \mathrm{x3}\\ \mathrm{douty2}\, \mathrm{x1} & \mathrm{douty2}\, \mathrm{x2} & \mathrm{douty2}\, \mathrm{x3} \end{array}\right)

Input Tensor

Our library will be handling images, and most of the time we will be handling matrix operations on hundreds of images at the same time. So we must find a way to represent them, here we will represent batch of images as a 4d tensor, or an array of 3d matrices. Bellow we have a batch of 4 rgb images (width:160, height:120). We're going to load them on matlab/python and organize them one a 4d matrix

On Python before we store the image on the tensor we do a transpose to convert out image 120x160x3 to 3x120x160, then to store on a tensor 4x3x120x160

Python Implementation

Forward Propagation

Backward Propagation

Matlab Implementation

One special point to pay attention is the way that matlab represent high-dimension arrays in contrast with matlab. Also another point that may cause confusion is the fact that matlab represent data on col-major order and numpy on row-major order.

Multidimensional arrays in python and matlab

One difference on how matlab and python represent multidimensional arrays must be noticed. We want to create a 4 channel matrix 2x3. So in matlab you need to create a array (2,3,4) and on python it need to be (4,2,3)

Matlab Reshape order

As mentioned before matlab will run the command reshape one column at a time, so if you want to change this behavior you need to transpose first the input matrix.

If you are dealing with more than 2 dimensions you need to use the "permute" command to transpose. Now on Python the default of the reshape command is one row at a time, or if you want you can also change the order (This options does not exist in matlab)

Bellow we have a reshape on the row-major order as a new function:

The other option would be to avoid this permutation reshape is to have the weight matrix on a different order and calculate the forward propagation like this:

[w11w12w13w21w22w23].[x1sample1x1sample2x2sample1x2sample2x3sample1x3sample2]+[b1b1b2b2]=[y1sample1y1sample2y2sample1y2sample2]\begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \end{bmatrix} . \begin{bmatrix} x_{1 sample 1} & x_{1 sample 2} \\ x_{2 sample 1} & x_{2 sample 2} \\ x_{3 sample 1} & x_{3 sample 2} \\ \end{bmatrix}+\begin{bmatrix} b_{1} & b_{1} \\ b_{2} & b_{2} \end{bmatrix}=\begin{bmatrix} y_{1 sample 1} & y_{1 sample 2} \\ y_{2 sample 1} & y_{2 sample 2} \end{bmatrix}

With x as a column vector and the weights organized row-wise, on the example that is presented we keep using the same order as the python example.

Forward Propagation

Backward Propagation

Next Chapter

Next chapter we will learn about Relu layers

Last updated