Model Initialization

Introduction

One important topic we should learn before we start training our models, is about the weight initialization. Bad weight initialization, can lead to a "never convergence training" or a slow training.

Weight matrix format

As observed on previous chapters, the weight matrix has the following format:

[w11w12w13w21w22w23]One collumn per x dimension.[x1x2x3]+[b1b2]=[y1y2]H(X)=(W.x)+bTH(X)=(WT.x)+b\underbrace{\begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \end{bmatrix}}_{\text{One collumn per x dimension}}. \begin{bmatrix} x_{1} \\ x_{2} \\ x_{3} \\ \end{bmatrix}+\begin{bmatrix} b_{1} \\ b_{2} \end{bmatrix}=\begin{bmatrix} y_{1} \\ y_{2} \end{bmatrix} \therefore \\ H(X) = (W.x)+b^T \\ H(X) = (W^T.x)+b

Consider the number of outputs fanoutfan_{out}, as rows and the number of inputs faninfan_{in} as columns. You could also consider another format:

([x1x2x3].[w11w21w12w22w13w23])+[b1b2]=[y1y2]H(X)=(x.Wt)+b(\begin{bmatrix} x_{1} & x_{2} & x_{3} \end{bmatrix}.\begin{bmatrix} w_{11} & w_{21} \\ w_{12} & w_{22} \\ w_{13} & w_{23} \end{bmatrix})+\begin{bmatrix} b_{1} & b_{2}\end{bmatrix}=\begin{bmatrix} y_{1} & y_{2}\end{bmatrix} \therefore \\ H(X)=(x.W^t)+b

Here fanoutfan_{out} as columns and faninfan_{in} as rows.

The whole point is that our weights is going to be a 2d matrix function of faninfan_{in} and fanoutfan_{out}

Initialize all to zero

If you initialize your weights to zero, your gradient descent will never converge

Initialize with small values

A better idea is to initialize your weights with values close to zero (but not zero), ie: 0.01

W=0.01np.random.randn(fanin,fanout)W = 0.01* np.random.randn(fan_{in},fan_{out})

Here randn gives random data with zero mean, unit standard deviation. fanin,fanoutfan_{in}, fan_{out} are the number of input and outputs. The 0.01 term will keep the random weights small and close to zero.

The problem with the previous way to do initialization is that the variance of the outputs will grow with the number of inputs. To solve this issue we can divide the random term by the square root of the number of inputs.

W=(np.random.randn(fanin,fanout))/np.sqrt(fanin)W = (np.random.randn(fan_{in},fan_{out})) / np.sqrt(fan_{in})

Now it seems that we don't have dead neurons, the only problem with this approach is to use it with Relu neurons.

To solve this just add a simple (divide by 2) term....

W=(np.random.randn(fanin,fanout))/np.sqrt(fanin/2)W = (np.random.randn(fan_{in},fan_{out})) / np.sqrt(fan_{in}/2)

So use this second form to initialize Relu layers.

Bath Norm layer

On future chapters we're going to learn a technique that make your model more resilient to specific initialization.

Next chapter

Next chapter we start talk about convolutions.

Last updated