One important topic we should learn before we start training our models, is about the weight initialization. Bad weight initialization, can lead to a "never convergence training" or a slow training.
Weight matrix format
As observed on previous chapters, the weight matrix has the following format:
One collumn per x dimension[w11w21w12w22w13w23].x1x2x3+[b1b2]=[y1y2]∴H(X)=(W.x)+bTH(X)=(WT.x)+b
Consider the number of outputs fanout, as rows and the number of inputs fanin as columns. You could also consider another format:
The whole point is that our weights is going to be a 2d matrix function of fanin and fanout
Initialize all to zero
If you initialize your weights to zero, your gradient descent will never converge
Initialize with small values
A better idea is to initialize your weights with values close to zero (but not zero), ie: 0.01
W=0.01∗np.random.randn(fanin,fanout)
Here randn gives random data with zero mean, unit standard deviation. fanin,fanout are the number of input and outputs. The 0.01 term will keep the random weights small and close to zero.
The problem with the previous way to do initialization is that the variance of the outputs will grow with the number of inputs. To solve this issue we can divide the random term by the square root of the number of inputs.