Model Initialization
Last updated
Last updated
One important topic we should learn before we start training our models, is about the weight initialization. Bad weight initialization, can lead to a "never convergence training" or a slow training.
As observed on previous chapters, the weight matrix has the following format:
Consider the number of outputs , as rows and the number of inputs as columns. You could also consider another format:
Here as columns and as rows.
The whole point is that our weights is going to be a 2d matrix function of and
If you initialize your weights to zero, your gradient descent will never converge
A better idea is to initialize your weights with values close to zero (but not zero), ie: 0.01
The problem with the previous way to do initialization is that the variance of the outputs will grow with the number of inputs. To solve this issue we can divide the random term by the square root of the number of inputs.
Now it seems that we don't have dead neurons, the only problem with this approach is to use it with Relu neurons.
To solve this just add a simple (divide by 2) term....
So use this second form to initialize Relu layers.
On future chapters we're going to learn a technique that make your model more resilient to specific initialization.
Next chapter we start talk about convolutions.
Here randn gives random data with zero mean, unit standard deviation. are the number of input and outputs. The 0.01 term will keep the random weights small and close to zero.