In this chapter we will learn about Supervised learning as well as talk a bit about Cost Functions and Gradient descent. Also we will learn about 2 simple algorithms:

Linear Regression (for Regression)

Logistic Regression (for Classification)

The first thing to learn about supervised learning is that every sample data point x has an expected output or label y, in other words your training is composed of $(x^{(i)}, y^{(i)})$ pairs.

For example consider the table below:

Size in feet (x) | Price (y) |

2104 | 460 |

1416 | 232 |

1534 | 315 |

852 | 178 |

This table (or training set) shows the sizes of houses along with their price. So a house of size 2104 feet costs 460.

The idea is that we could use data like this to create models that can predict some outcome (ie: Price) from different inputs (ie: Size in feet).

Regression is about returning a continuous scalar number from some input. The model or hypothesis that we will learn will predict a value following a linear rule. However sometimes a linear model is not enough to capture the underlying nature of your data.

Basically linear regression trys to create a line $f(x)=\beta + \alpha.x$, that fits the data on training, e.g.

$\Large h_{\theta_{i}}(x) = \theta_0 + \theta_1.x$

$\text{where } \theta_{i} \in \{\theta_{0}, \theta_{1}\}$

The whole idea of supervised learning is that we try to learn the best parameters (theta in this case) from our training set.

Before we talk about how to learn the parameters (also called weights) of our hypothesis we need to know how to evaluate if our current set of weights are already doing a good job. The function that does this job is called Loss or Cost function. Basically it will return a scalar value between 0(No error) and infinity (Really bad).

An example of such a function is given below:

$\Large J(\theta_{i})=\frac{1}{2.m}.\sum_{i=1}^{m} [h_{\theta_i}(x^{(i)})-y^{(i)}]^2$

where

m: Number of items in your dataset

(i): i-th element of your dataset

y: Label(expected value) in dataset

This particular cost function is called mean-squared error loss, and is actually very useful for regression problems.

During training we want to minimise our loss by continually changing the theta parameters. Another nice feature of this paticular function is that it is a convex function so it is guaranteed to have no more than one minimum, which will also be it's global minimum. This makes it easier for us to optimise.

Our task is therefor to find:

$\Large \text{min}_\theta J(\theta_{i})$

Gradient descent is a simple algorithm that will try to find the local minimum of a function. We use gradient descent to minimise our loss function. One important feature to observe on gradient descent is that it will more often than not get stuck into the first local minimum that it encounters. However there is no guarantee that this local minimum it finds is the best (global) one.

The gradient descent needs the first derivative of the function that you want to minimise. So if we want to minimise some function by changing parameters, you need to derive this function with respect to these parameters.

One thing to note is as this algorithm will execute on all samples in your training set it does not scale well for bigger datasets.

Bellow we have some simple implementation in matlab that uses gradient descent to minimise the following function:

$f(x)=x^4-3.x^3+2$

It's derivative with respect to x is:

$\frac{\partial f(x)}{\partial x}=4.x^3-9.x^2$

The code to find at what point our local minimum occurs is as follows:

% Some tests on Gradient descent%% Define parameters start at 3.5x_old=3.5; alpha=0.01; precision=0.0001;%% Define functionx_input = [-1:0.01:3.5];f = @(x) x.^4 - 3*x.^3 + 2;df = @(x) 4*x.^3 - 9*x.^2;y_output = f(x_input);plot(x_input, y_output);%% Gradient descent algorithm% Keep repeating until convergencewhile 1% Evalulate gradientstmpDelta = x_old - alpha*(df(x_old));% Check ConvergencediffOldTmp = abs(tmpDelta - x_old);if diffOldTmp < precisionbreak;end% Update parametersx_old = tmpDelta;endfprintf('The local minimum is at %d\n', x_old);

In order to use gradient descent for linear regression you need to calculate the derivative of it's loss (means squared error) with respect to it's parameters.

This derivative will be:

$\Large \frac{\partial J(\theta)}{\partial \theta}=\frac{1}{m}.\sum_{i=1}^{m} (h_{\theta_i}(x^{(i)})-y^{(i)}).x^{(i)}_j$

The name may sound confusing but actually Logistic Regression is simply all about classification. For instance consider the example below where we want to classify 2 classes: x's and o's. Now our output y will have two possible values [0,1].

Normally we do not use Logistic Regression if we have a large number of features (e.g. more than 100 features).

Our hypothesis is almost the same compared to the linear regression however the difference is that now we use a function that will force our output to give $y \in [0,1]$

$\Large h_{\theta}(x)=g(\theta^T.x)$

where

$g(z)=\frac{1}{1+e^-z}$ is the logistic or sigmoid function, therefore

$\Large h_{\theta}(x)=\frac{1}{1+e^{(\theta^T.x)}}$

Here the sigmoid function will convert a scalar number to some probability between 0 and 1.

When dealing with classification problems, we should use a different cost/loss function. A good candidate for classification is the cross-entropy cost function. By the way this function can be found by using the Maximum Likelihood estimation method.

The cross-entropy cost function is:

$\Large J(\theta)=\frac{1}{m}.\sum_{i=1}^{m}[y^{(i)}.log(h_{\theta}(x^{(i)}))+(1-y^{(i)}).log(1-h_{\theta}(x^{(i)}))]$

Again our objective is to find the best parameter theta that minimize this function, so to use gradient descent we need to calculate the derivative of this function with respect to theta

$\Large \frac{\partial J(\theta)}{\partial \theta}=\frac{1}{m}.\sum_{i=1}^{m} (h_{\theta_i}(x^{(i)})-y^{(i)}).x^{(i)}_j$

One cool thing to note is that the derivative is the same as the derivative from linear regression.

The main idea of training in machine learning is to let the computer learn the underlying structure of the training set and not just the specific training set that it sees. If it does not learn the underlying structure and instead learns just the structure of the training samples then we say our model has overfit.

We can tell overfitting has occured when our model has a very good accuracy on the training set (e.g.: 99.99%) but does not perform that well on the test set (e.g.: 60%).

This basically occurs when your model is too complex in relation to the available data, and/or you don't have enough data to capture the underlying data pattern.

The opposite of this problem is called Underfitting, basically it happens when your model is too simple to capture the concept given in the training set.

Some graphical examples of these problems are shown below:

**Left:** Underfit

**Middle: **Perfect

**Right:** Overfit

Get more data

Use regularisation

Check other model architectures

Add more layers or more parameters

Check other architectures

It's a method that helps overfitting by forcing your hypothesis parameters to have nice small values and to force your model to use all the available parameters. To use regularization you just need to add an extra term to your cost/loss function during training. Below we have an example of the Mean squared error loss function with an added regularization term.

$\Large J(\theta)= \dfrac{1}{2m}\ \left[ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\ \sum_{j=1}^n \theta_j^2 \right]$

Also the regularized version of the cross-entropy loss function

$\Large J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2$

This term will basically multiply by $\lambda$ all parameters $\theta$ from your hypothesis

Normally we don't need to regularise the bias term of our hypothesis.

During gradient descent you also need to calculate the derivative of those terms.

You can imagine that besides forcing the weights to have lower values, the regularisation will also spread the "concept" across more weights. One way to think about what happens when we regularise is shown below.

For our input x we can have 2 weight vectors w1 or w2. When we multiply our input with our weight vector we will get the same output of 1 for each. However the weight vector w2 is a better choice of weight vector as this will look at more of our input x compared to w1 which only looks at one value of x.

$x = [1,1,1,1]\\w_1=[1,0,0,0]\\w_2=[0.25,0.25,0.25,0.25]\\\therefore w_1^T.x=w_2^T.x=1$

In practice this might reduce performance on our training set but will improve generalisation and thus performance when testing.

Sometimes our hypothesis needs to have non-linear terms to be able to predict more complex classification/regression problems. We can for instance use Logistic Regression with quadratic terms to capture this complexity. As mentioned earlier this does not scale well for large number of features.

For instance if we would like to classify a 100x100 grayscale image using logistic regression we would need 50 million parameters to include the quadratic terms. Training with this number of features will need at least 10x more images (500 millions) to avoid overfitting. Also the computational cost will be really high.

In the next chapters we will learn other algorithms that scale better for bigger number of features.

As we increase the number of parameters on our model our loss function will start to find multiple local minima. This can be a problem for training because the gradient descent can get stuck in one of them.

Actually some recent papers also try to prove that some of the methods used today (ie: Deep Neural networks) the local-minima is actually really close to the global minima, so you don't need to care about using a Convex loss or about getting stuck in a local-minima.