Artificial Inteligence
  • Preface
  • Introduction
  • Machine Learning
    • Linear Algebra
    • Supervised Learning
      • Neural Networks
      • Linear Classification
      • Loss Function
      • Model Optimization
      • Backpropagation
      • Feature Scaling
      • Model Initialization
      • Recurrent Neural Networks
        • Machine Translation Using RNN
    • Deep Learning
      • Convolution
      • Convolutional Neural Networks
      • Fully Connected Layer
      • Relu Layer
      • Dropout Layer
      • Convolution Layer
        • Making faster
      • Pooling Layer
      • Batch Norm layer
      • Model Solver
      • Object Localization and Detection
      • Single Shot Detectors
        • Yolo
        • SSD
      • Image Segmentation
      • GoogleNet
      • Residual Net
      • Deep Learning Libraries
    • Unsupervised Learning
      • Principal Component Analysis
      • Generative Models
    • Distributed Learning
    • Methodology for usage
      • Imbalanced/Missing Datasets
  • Artificial Intelligence
    • OpenAI Gym
    • Tree Search
    • Markov Decision process
    • Reinforcement Learning
      • Q_Learning_Simple
      • Deep Q Learning
      • Deep Reinforcement Learning
    • Natural Language Processing
      • Word2Vec
  • Appendix
    • Statistics and Probability
      • Probability
        • Markov Chains
        • Random Walk
    • Lua and Torch
    • Tensorflow
      • Multi Layer Perceptron MNIST
      • Convolution Neural Network MNIST
      • SkFlow
    • PyTorch
      • Transfer Learning
      • DataLoader and DataSets
      • Visualizing Results
Powered by GitBook
On this page
  • Introduction
  • Weight matrix format
  • Initialize all to zero
  • Initialize with small values
  • Bath Norm layer
  • Next chapter

Was this helpful?

  1. Machine Learning
  2. Supervised Learning

Model Initialization

PreviousFeature ScalingNextRecurrent Neural Networks

Last updated 5 years ago

Was this helpful?

Introduction

One important topic we should learn before we start training our models, is about the weight initialization. Bad weight initialization, can lead to a "never convergence training" or a slow training.

Weight matrix format

As observed on previous chapters, the weight matrix has the following format:

[w11w12w13w21w22w23]⏟One collumn per x dimension.[x1x2x3]+[b1b2]=[y1y2]∴H(X)=(W.x)+bTH(X)=(WT.x)+b\underbrace{\begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \end{bmatrix}}_{\text{One collumn per x dimension}}. \begin{bmatrix} x_{1} \\ x_{2} \\ x_{3} \\ \end{bmatrix}+\begin{bmatrix} b_{1} \\ b_{2} \end{bmatrix}=\begin{bmatrix} y_{1} \\ y_{2} \end{bmatrix} \therefore \\ H(X) = (W.x)+b^T \\ H(X) = (W^T.x)+bOne collumn per x dimension[w11​w21​​w12​w22​​w13​w23​​]​​.​x1​x2​x3​​​+[b1​b2​​]=[y1​y2​​]∴H(X)=(W.x)+bTH(X)=(WT.x)+b

Consider the number of outputs fanoutfan_{out}fanout​, as rows and the number of inputs faninfan_{in}fanin​ as columns. You could also consider another format:

([x1x2x3].[w11w21w12w22w13w23])+[b1b2]=[y1y2]∴H(X)=(x.Wt)+b(\begin{bmatrix} x_{1} & x_{2} & x_{3} \end{bmatrix}.\begin{bmatrix} w_{11} & w_{21} \\ w_{12} & w_{22} \\ w_{13} & w_{23} \end{bmatrix})+\begin{bmatrix} b_{1} & b_{2}\end{bmatrix}=\begin{bmatrix} y_{1} & y_{2}\end{bmatrix} \therefore \\ H(X)=(x.W^t)+b([x1​​x2​​x3​​].​w11​w12​w13​​w21​w22​w23​​​)+[b1​​b2​​]=[y1​​y2​​]∴H(X)=(x.Wt)+b

Here fanoutfan_{out}fanout​ as columns and faninfan_{in}fanin​ as rows.

The whole point is that our weights is going to be a 2d matrix function of faninfan_{in}fanin​ and fanoutfan_{out}fanout​

Initialize all to zero

If you initialize your weights to zero, your gradient descent will never converge

Initialize with small values

A better idea is to initialize your weights with values close to zero (but not zero), ie: 0.01

W=0.01∗np.random.randn(fanin,fanout)W = 0.01* np.random.randn(fan_{in},fan_{out})W=0.01∗np.random.randn(fanin​,fanout​)

Here randn gives random data with zero mean, unit standard deviation. fanin,fanoutfan_{in}, fan_{out}fanin​,fanout​ are the number of input and outputs. The 0.01 term will keep the random weights small and close to zero.

The problem with the previous way to do initialization is that the variance of the outputs will grow with the number of inputs. To solve this issue we can divide the random term by the square root of the number of inputs.

W=(np.random.randn(fanin,fanout))/np.sqrt(fanin)W = (np.random.randn(fan_{in},fan_{out})) / np.sqrt(fan_{in})W=(np.random.randn(fanin​,fanout​))/np.sqrt(fanin​)

Now it seems that we don't have dead neurons, the only problem with this approach is to use it with Relu neurons.

To solve this just add a simple (divide by 2) term....

W=(np.random.randn(fanin,fanout))/np.sqrt(fanin/2)W = (np.random.randn(fan_{in},fan_{out})) / np.sqrt(fan_{in}/2)W=(np.random.randn(fanin​,fanout​))/np.sqrt(fanin​/2)

So use this second form to initialize Relu layers.

Bath Norm layer

On future chapters we're going to learn a technique that make your model more resilient to specific initialization.

Next chapter

Next chapter we start talk about convolutions.