Machine learning models learn by updating it's parameters (weights and biases) towards the direction of the correct classification.

On the picture bellow we will show the basic blocks of a machine learning model

On this picture we can detect the following components 1. Training dataset: Basically a high speed disk containing your training data 2. Batch of samples: A list of pairs (X,Y), consisting on inputs, expected outputs, for example X can be a image and Y the label "cat" 3. Parameters: Set of parameters used by your model layers, to map X to Y 4. Model: Set of computing layers that transform an input X and weights W, into a score (probable Y) 5. Loss function: Responsible to say how far our score is from the ideal response Y, the output of the loss function is a scalar. Another way is also to consider that the loss function say how bad is your current set of parameters W.

Basically we need an algorithm that will change our weight and biases, in order to minimize our loss function.

You can imagine the loss function here as a place with some mountains, and your objective is to find it's vale (lowest) place. Your only instrument is the gadget that returns your altitude (loss). You need to find out which direction to take.

Here we can also observe two things:

You have more than one vale (Local minima)

Depending where you landed you probably find one instead of the other (Importance of weight initialization)

We can use calculus to discover which direction to take, for instance if we follow the derivative of the loss function, we can guarantee that we always go down. This is done just by subtracting the current set of weights by derivative of the loss function evaluated at that point. $\large W_{new}=W_{old}-(\gamma.\nabla{_w}_L)$ On multiple dimensions we have a vector of partial derivatives, which we call gradient. Observe that we multiply the gradient by a factor $\gamma$ (step-size, learning-rate), that is normally a small value ex: 0.001

To illustrate the gradient descent method let's follow a simple case in 1D. Consider the initial weight (-1.5)

â€‹$\large L(w)=w^4-3w^3+2 \therefore W_{old}=-1.5\\\large\nabla{_L}=\frac{dL(w)}{dw}=4w^3-9w^2 \therefore \text{evaluate on current weight}\\ \large\frac{dL(-1.5)}{dw}=4(1.5)^3-9(-1.5)^2=-33.75 \Rightarrow \nabla{_L}_w \\\large \nabla{_L}_w=-33.75 \therefore\\\large W{new}=(-1.5)-(0.001.(-33.75)) \Rightarrow -1.49$â€‹

On Matlab

We can observe that after we use calculus to find the derivative of the loss function, we just needed to evaluate it with the current weight. After that we just take the evaluated value from the current weight.

Using a big learning rate can accelerate convergence, but could also make the gradient descent oscillate and even diverge.

Slow method to evaluate the gradient, but we can use it to verify if our code is right

Instead of running through all the training set to calculate the loss, then do gradient descent, we can do it in small (batch) portions. This leads to similar results while computing faster. Normally the mini-batch size is dependent on the size of GPU memory.

If you analyze your loss decay over time the full-batch version is less noisy (because we are choosing the best possible direction to descend each time we update weights) but it is much longer and difficult to compute.

At the other extreme if your mini-batch size goes to 1, which means you compute the gradient descent for each sample. In this case we have what is called Stochastic Gradient Descent. This is relatively less common to see because in practice due to vectorized code optimizations it can be computationally much more efficient to evaluate the gradient for 100 examples, than the gradient for one example 100 times. Also with using a bigger batch size we are more likely to be estimating the true gradient.

Actually what people do is to choose a high (but not so high) learning rate, then decay with time.

Here are 2 ways to decay the learning rate with time while training:

Divide the learning rate ($\alpha$) by 2 every x epochs

â€‹$\alpha = \alpha_{0}.e^{-k.t}$, where t is time and k is the decay parameter

â€‹$\alpha=\frac{\alpha_{0}}{1+k.t}$, where t is the time and k the decay parameter

An important thing to note is that learning rate scale proportionaly with batch size so if we increase our batch size by 2x we can double our learning rate.

This is because as the size of the batch increases we become more confident that the gradient we calculate is going in the correct direction for optimisation. As we are more confident the direction is correct we can safely take bigger steps in that direction without fear we will start to diverge.

Going the other way if our batch size is smaller we have to reduce learning rate as we arent completely confident in the direction of descent as we have less samples to work with so we must take small steps so any bad gradient updates dont take us too far off the correct path.

The gradient descent is the simplest idea to do model optimization. There are a few other nice algorithms to try when thinking about model optimization that are all based on gradient descent but with some extra little things added:

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent with momentum (Very popular)

Nestorov's accelerated gradient (NAG)

Adaptive graident (AdaGrad)

Adam (Very good because you need to take less care about learning rate)

RMSprop

Below is an animation showing how fast some of these different optimisers reacher the minimum of a function.

Now what we need is a way to get the gradient vector of our model, next chapter we will talk about back-propagation.