Model Initialization

Introduction

One important topic we should learn before we start training our models, is about the weight initialization. Bad weight initialization, can lead to a "never convergence training" or a slow training.

Weight matrix format

As observed on previous chapters, the weight matrix has the following format:

$\underbrace{\begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \end{bmatrix}}_{\text{One collumn per x dimension}}. \begin{bmatrix} x_{1} \\ x_{2} \\ x_{3} \\ \end{bmatrix}+\begin{bmatrix} b_{1} \\ b_{2} \end{bmatrix}=\begin{bmatrix} y_{1} \\ y_{2} \end{bmatrix} \therefore \\ H(X) = (W.x)+b^T \\ H(X) = (W^T.x)+b$

Consider the number of outputs $fan_{out}$ , as rows and the number of inputs $fan_{in}$ as columns. You could also consider another format:

$(\begin{bmatrix} x_{1} & x_{2} & x_{3} \end{bmatrix}.\begin{bmatrix} w_{11} & w_{21} \\ w_{12} & w_{22} \\ w_{13} & w_{23} \end{bmatrix})+\begin{bmatrix} b_{1} & b_{2}\end{bmatrix}=\begin{bmatrix} y_{1} & y_{2}\end{bmatrix} \therefore \\ H(X)=(x.W^t)+b$

Here $fan_{out}$ as columns and $fan_{in}$ as rows.

The whole point is that our weights is going to be a 2d matrix function of $fan_{in}$ and $fan_{out}$

Initialize all to zero

If you initialize your weights to zero, your gradient descent will never converge

Initialize with small values

A better idea is to initialize your weights with values close to zero (but not zero), ie: 0.01

$W = 0.01* np.random.randn(fan_{in},fan_{out})$

Here randn gives random data with zero mean, unit standard deviation. $fan_{in}, fan_{out}$ are the number of input and outputs. The 0.01 term will keep the random weights small and close to zero.

The problem with the previous way to do initialization is that the variance of the outputs will grow with the number of inputs. To solve this issue we can divide the random term by the square root of the number of inputs.

$W = (np.random.randn(fan_{in},fan_{out})) / np.sqrt(fan_{in})$