Backpropagation

Introduction

Backpropagation is an algorithm that calculate the partial derivative of every node on your model (ex: Convnet, Neural network). Those partial derivatives are going to be used during the training phase of your model, where a loss function states how much far your are from the correct result. This error is propagated backward from the model output back to it's first layers. The backpropagation is more easily implemented if you structure your model as a computational graph.

The most important thing to have in mind here is how to calculate the forward propagation of each block and it's gradient. Actually most of the deep learning libraries code is about implementing those gates forward/backward code.

Basic blocks

Some examples of basic blocks are, add, multiply, exp, max. All we need to do is observe their forward and backward calculation

Some other derivatives

\LARGE f(x) = \frac{1}{x} \hspace{1in} \rightarrow \hspace{1in} \frac{df}{dx} = -1/x^2 \\\\ \LARGE f_c(x) = c + x \hspace{1in} \rightarrow \hspace{1in} \frac{df}{dx} = 1 \\\\ \LARGE f(x) = e^x \hspace{1in} \rightarrow \hspace{1in} \frac{df}{dx} = e^x \\\\ \LARGE f_a(x) = ax \hspace{1in} \rightarrow \hspace{1in} \frac{df}{dx} = a \\\\

Observe that we output 2 gradients because we have 2 inputs... Also observe that we need to save (cache) on memory the previous inputs.

Chain Rule

Imagine that you have an output y, that is function of g, which is function of f, which is function of x. If you want to know how much g will change with a small change on dx (dg/dx), we use the chain rule. Chain rule is a formula for computing the derivative of the composition of two or more functions.

The chain rule is the work horse of back-propagation, so it's important to understand it now. On the picture bellow we get a node f(x,y) that compute some function with two inputs x,y and output z. Now on the right side, we have this same node receiving from somewhere (loss function) a gradient dL/dz which means. "How much L will change with a small change on z". As the node has 2 inputs it will have 2 gradients. One showing how L will a small change dx and the other showing how L will change with a small change (dz)

In order to calculate the gradients we need the input dL/dz (dout), and the derivative of the function f(x,y), at that particular input, then we just multiply them. Also we need the previous cached input, saved during forward propagation.

Gates Implementation

Observe bellow the implementation of the multiply and add gate on python

Step by step example

With what we learn so far, let's calculate the partial derivatives of some graphs

Simple example

Here we have a graph for the function $f(x,y,z) = (x+y).z$

1. Start from output node f, and consider that the gradient of f related to some criteria is 1 (dout) 2. dq=(dout(1) z), which is -4 (How the output will change with a change in q) 3. dz=(dout(1) q), which is 3 (How the output will change with a change in z) 4. The sum gate distribute it's input gradients, so dx=-4, dy=-4 (How the output will change with x,z)

Perceptron with 2 inputs

This following graph represent the forward propagation of a simple 2 inputs, neural network with one output layer with sigmoid activation.

$\huge f(w,x)=\frac{1}{1+e^{-(w0x0+w1x1+w2)}}$

Start from the output node, considering that or error(dout) is 1
The gradient of the input of the 1/x will be -1/(1.37^2), -0.53
The increment node does not change the gradient on it's input, so it will be (-0.53 * 1), -0.53
The exp node input gradient will be (exp(-1(cached input)) * -0.53), -0.2
The negative gain node will be it's input gradient (-1 * -0.2), 0.2
The sum node will distribute the gradients, so, dw2=0.2, and the sum node also 0.2
The sum node again distribute the gradients so again 0.2
dw0 will be (0.2 * -1), -0.2
dx0 will be (0.2 * 2). 0.4

Next Chapter

Next chapter we will learn about Feature Scaling.

PreviousModel Optimization NextFeature Scaling

Last updated 5 years ago

Was this helpful?