Artificial Inteligence
  • Preface
  • Introduction
  • Machine Learning
    • Linear Algebra
    • Supervised Learning
      • Neural Networks
      • Linear Classification
      • Loss Function
      • Model Optimization
      • Backpropagation
      • Feature Scaling
      • Model Initialization
      • Recurrent Neural Networks
        • Machine Translation Using RNN
    • Deep Learning
      • Convolution
      • Convolutional Neural Networks
      • Fully Connected Layer
      • Relu Layer
      • Dropout Layer
      • Convolution Layer
        • Making faster
      • Pooling Layer
      • Batch Norm layer
      • Model Solver
      • Object Localization and Detection
      • Single Shot Detectors
        • Yolo
        • SSD
      • Image Segmentation
      • GoogleNet
      • Residual Net
      • Deep Learning Libraries
    • Unsupervised Learning
      • Principal Component Analysis
      • Generative Models
    • Distributed Learning
    • Methodology for usage
      • Imbalanced/Missing Datasets
  • Artificial Intelligence
    • OpenAI Gym
    • Tree Search
    • Markov Decision process
    • Reinforcement Learning
      • Q_Learning_Simple
      • Deep Q Learning
      • Deep Reinforcement Learning
    • Natural Language Processing
      • Word2Vec
  • Appendix
    • Statistics and Probability
      • Probability
        • Markov Chains
        • Random Walk
    • Lua and Torch
    • Tensorflow
      • Multi Layer Perceptron MNIST
      • Convolution Neural Network MNIST
      • SkFlow
    • PyTorch
      • Transfer Learning
      • DataLoader and DataSets
      • Visualizing Results
Powered by GitBook
On this page
  • Introduction
  • Basic blocks
  • Chain Rule
  • Gates Implementation
  • Step by step example
  • Simple example
  • Perceptron with 2 inputs
  • Next Chapter

Was this helpful?

  1. Machine Learning
  2. Supervised Learning

Backpropagation

PreviousModel OptimizationNextFeature Scaling

Last updated 5 years ago

Was this helpful?

Introduction

Backpropagation is an algorithm that calculate the partial derivative of every node on your model (ex: Convnet, Neural network). Those partial derivatives are going to be used during the training phase of your model, where a loss function states how much far your are from the correct result. This error is propagated backward from the model output back to it's first layers. The backpropagation is more easily implemented if you structure your model as a computational graph.

The most important thing to have in mind here is how to calculate the forward propagation of each block and it's gradient. Actually most of the deep learning libraries code is about implementing those gates forward/backward code.

Basic blocks

Some examples of basic blocks are, add, multiply, exp, max. All we need to do is observe their forward and backward calculation

Some other derivatives

f(x)=1x→dfdx=−1/x2fc(x)=c+x→dfdx=1f(x)=ex→dfdx=exfa(x)=ax→dfdx=a\LARGE f(x) = \frac{1}{x} \hspace{1in} \rightarrow \hspace{1in} \frac{df}{dx} = -1/x^2 \\\\ \LARGE f_c(x) = c + x \hspace{1in} \rightarrow \hspace{1in} \frac{df}{dx} = 1 \\\\ \LARGE f(x) = e^x \hspace{1in} \rightarrow \hspace{1in} \frac{df}{dx} = e^x \\\\ \LARGE f_a(x) = ax \hspace{1in} \rightarrow \hspace{1in} \frac{df}{dx} = a \\\\f(x)=x1​→dxdf​=−1/x2fc​(x)=c+x→dxdf​=1f(x)=ex→dxdf​=exfa​(x)=ax→dxdf​=a

Observe that we output 2 gradients because we have 2 inputs... Also observe that we need to save (cache) on memory the previous inputs.

Chain Rule

Imagine that you have an output y, that is function of g, which is function of f, which is function of x. If you want to know how much g will change with a small change on dx (dg/dx), we use the chain rule. Chain rule is a formula for computing the derivative of the composition of two or more functions.

The chain rule is the work horse of back-propagation, so it's important to understand it now. On the picture bellow we get a node f(x,y) that compute some function with two inputs x,y and output z. Now on the right side, we have this same node receiving from somewhere (loss function) a gradient dL/dz which means. "How much L will change with a small change on z". As the node has 2 inputs it will have 2 gradients. One showing how L will a small change dx and the other showing how L will change with a small change (dz)

Gates Implementation

Observe bellow the implementation of the multiply and add gate on python

Step by step example

With what we learn so far, let's calculate the partial derivatives of some graphs

Simple example

Here we have a graph for the function f(x,y,z)=(x+y).zf(x,y,z) = (x+y).zf(x,y,z)=(x+y).z

Perceptron with 2 inputs

This following graph represent the forward propagation of a simple 2 inputs, neural network with one output layer with sigmoid activation.

f(w,x)=11+e−(w0x0+w1x1+w2)\huge f(w,x)=\frac{1}{1+e^{-(w0x0+w1x1+w2)}}f(w,x)=1+e−(w0x0+w1x1+w2)1​

  1. Start from the output node, considering that or error(dout) is 1

  2. The gradient of the input of the 1/x will be -1/(1.37^2), -0.53

  3. The increment node does not change the gradient on it's input, so it will be (-0.53 * 1), -0.53

  4. The exp node input gradient will be (exp(-1(cached input)) * -0.53), -0.2

  5. The negative gain node will be it's input gradient (-1 * -0.2), 0.2

  6. The sum node will distribute the gradients, so, dw2=0.2, and the sum node also 0.2

  7. The sum node again distribute the gradients so again 0.2

  8. dw0 will be (0.2 * -1), -0.2

  9. dx0 will be (0.2 * 2). 0.4

Next Chapter

Next chapter we will learn about Feature Scaling.

In order to calculate the gradients we need the input dL/dz (dout), and the derivative of the function f(x,y), at that particular input, then we just multiply them. Also we need the previous cached input, saved during forward propagation.

1. Start from output node f, and consider that the gradient of f related to some criteria is 1 (dout) 2. dq=(dout(1) z), which is -4 (How the output will change with a change in q) 3. dz=(dout(1) q), which is 3 (How the output will change with a change in z) 4. The sum gate distribute it's input gradients, so dx=-4, dy=-4 (How the output will change with x,z)