Neural Networks

Family of models that takes a very “loose” inspiration from the brain, used to approximate functions that depends on a large number of inputs. (Is a very good Pattern recognition model).

Neural networks are examples of Non-Linear hypothesis, where the model can learn to classify much more complex relations. Also it scale better than Logistic Regression for large number of features.

It's formed by artificial neurons, where those neurons are organised in layers. We have 3 types of layers:

  • Input layer

  • Hidden layers

  • Output layer

We classify the neural networks from their number of hidden layers and how they connect, for instance the network above have 2 hidden layers. Also if the neural network has/or not loops we can classify them as Recurrent or Feed-forward neural networks.

Neural networks from more than 2 hidden layers can be considered a deep neural network. The advantage of using more deep neural networks is that more complex patterns can be recognised.

Bellow we have an example of a 2 layer feed forward artificial neural network. Imagine that the connections between neurons are the parameters that will be learned during training. On this example Layer L1 will be the input layer, L2/L3 the hidden layer and L4 the output layer

Brain Differences

Now wait before you start thinking that you can just create a huge neural network and call strong AI, there are some few points to remember:

Just a list:

  • The artificial neuron fires totally different than the brain

  • A human brain has 100 billion neurons and 100 trillion connections (synapses) and operates on 20 watts(enough to run a dim light bulb) - in comparison the biggest neural network have 10 million neurons and 1 billion connections on 16,000 CPUs (about 3 million watts)

  • The brain is limited to 5 types of input data from the 5 senses.

  • Children do not learn what a cow is by reviewing 100,000 pictures labelled “cow” and “not cow”, but this is how machine learning works.

  • Probably we don't learn by calculating the partial derivative of each neuron related to our initial concept. (By the way we don't know how we learn)

Real Neuron

Artificial Neuron

The single artificial neuron will do a dot product between w and x, then add a bias, the result is passed to an activation function that will add some non-linearity. The neural network will be formed by those artificial neurons.

The non-linearity will allow different variations of an object of the same class to be learned separately. Which is a different behaviour compared to the linear classifier that tries to learn all different variations of the same class on a single set of weights. More neurons and more layers is always better but it will need more data to train.

Each layer learn a concept, from it's previous layer. So it's better to have deeper neural networks than a wide one. (Took 20 years to discover this)

Activation Functions

After the neuron do the dot product between it's inputs and weights, it also apply a non-linearity on this result. This non-linear function is called Activation Function. On the past the popular choice for activation functions were the sigmoid and tanh. Recently it was observed the ReLU layers has better response for deep neural networks, due to a problem called vanishing gradient. So you can consider using only ReLU neurons.

sigmoid: σ(x)=11+extanh: σ(x)=exexex+exReLU: σ(x)=max(0,x)\text{sigmoid: } \sigma(x) = \frac{1}{1+e^{-x}}\\ \text{tanh: } \sigma(x) = \frac{e^x-e^x}{e^x+e^x}\\ \text{ReLU: } \sigma(x) = max(0,x)

Example of simple network

Consider the Neural network bellow with 1 hidden layer, 3 input neurons, 3 hidden neurons and one output neuron.

We can define all the operation that this network will do as follows.

\begin{align*} a_1^{(2)} = g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3) \newline a_2^{(2)} = g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3) \newline a_3^{(2)} = g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3) \newline h_\Theta(x) = a_1^{(3)} = g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} + \Theta_{13}^{(2)}a_3^{(2)}) \newline \end{align*}

Here a1(2)a_1^{(2)} means the activation(output) of the first neuron of layer 2 (hidden layer on this case). The first layer, (input layer) can be considered as an(1)a_n^{(1)} and it's values are just the input vector.

Consider the connections between each layer as a matrix of parameters. Consider those matrices as the connections between layers. On this case we have to matrices:

  • θ(1)\theta^{(1)}: Map the layer 1 to layer 2 (Input and Hidden layer)

  • θ(2)\theta^{(2)}: Map layer 2 to to layer 2 (Hidden and output layer)

Also you consider the dimensions of θ(1)\theta^{(1)} as [number of neurons on layer 2] x [Number of neurons layer 1 +1]. In other words: sj+1×(sj+1)=4×3s_{j+1} \times (s_j + 1) = 4 \times 3 Where: sj+1s_{j+1}: Number of neurons on next layer

sj+1s_j+1: Number of neurons on the current layer + 1

Notice that this is only true if we consider to add the bias as part of our weight matrices, this could depend from implementation to implementation.

Why is better than Logistic Regression

Consider that the neural network as a cascaded chain of logistic regression, where the input of each layer is the output of the previous one. Another way to think on this is that each layer learn a concept with the output of the previous layer.

This is nice because the layer does not need to learn the whole concept at once, but actually build a chain of features that build that knowledge.

Vectorized Implementation

You can calculate the output of the whole layer a(n)a^{(n)}as a matrix multiplication followed by a element-wise activation function. This has the advantage of performance, considering that you are using tools like Matlab, Numpy, or also if you are implementing on hardware.

The mechanism of calculating the output of each layer with the output of the previous layer, from the beginning(input layer) to it's end(output layer) is called forward propagation.

To better understanding let's break the activation of some layer as following:

A(n)=g(Z(n))Z(n)=Θ(n1).A(n1)A^{(n)}=g(Z^{(n)})\\Z^{(n)}=\Theta^{(n-1)}.A^{(n-1)}

So using this formulas you can calculate the activation of each layer.

Multiple class problems

On the multi-class classification problem you need to allocate one neuron for each class, than during training you provide a one-hot vector for each one of your desired class. This is somehow easier than the logistic regression one-vs-all training.

Cost function (Classification)

The cost function of neural networks, it's a little more complicated than the logistic regression. So for classification on Neural networks we should use:

\begin{gather*} J(\Theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{gather*}

Where:

  • L: Number of layers

  • m: Dataset size

  • K: Number of classes

  • sls_l: Number of neurons (not counting bias) from layer l

During training we need to calculate the partial derivative of this cost function with respect to each parameter on your neural network. Actually what we need to compute is:

  • The loss itself (Forward-propagation)

  • The derivative of the loss w.r.t each parameter (Back-propagation)

Backpropagation

Backpropagation it's an efficient algorithm that helps you calculate the derivative of the cost function with respect to each parameter of the neural network. The name backpropagation comes from the fact that now we start calculating errors from all your neurons from the output layer to the input layer direction. After those errors are calculated we simply multiply them by the activation calculated during forward propagation.

Doing the backpropagation (Vectorized)

As mentioned the backpropagation will flow on the reverse order iterating from the last layer. So starting from the output layer we calculate the output layer error.

The "error values" for the last layer are simply the differences of our actual results in the last layer and the correct outputs in y.

δ(L)=ya(L)\delta^{(L)} = y - a^{(L)}

Where

  • yy: Expected output from training

  • a(L)a^{(L)}: Network output/activation of the last L layer

For all other layers (layers before the last layer until the input)

δ(l)=((Θ(l))Tδ(l+1)) . g(z(l))\Large \delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ g'(z^{(l)})

Where

  • δ(l)\delta^{(l)}: Error of layer l

  • g(x)g'(x): Derivative of activation function

  • z(l)z^{(l)}: Pre-activation of layer l

  • .*: Element wise multiplication

After all errors(delta) are calculated we need to actually calculate the derivative of the loss, which is the product of the error times the activation of each respective neuron:

J(Θ)Θi,j(l)=1mt=1maj(t)(l)δi(t)(l+1)\Large \dfrac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}} = \frac{1}{m}\sum_{t=1}^m a_j^{(t)(l)} {\delta}_i^{(t)(l+1)}

Now we're ignoring the regularisation term.

Complete algorithm

Bellow we describe the whole procedure in pseudo-code

Gradient Checking

In order to verify if your backpropagation code is right we can estimate the gradient, using other algorithm, unfortunately we cannot use this algorithm in practice because it will make the training slow, but we can use to compare it's results with the backpropagation. Basically we will calculate numerically the derivative of the loss with respect to each parameter by calculating the loss and adding a small perturbation (ie: 10410^{-4}) to each parameter one at a time.

ΘJ(Θ)J(Θ+ϵ)J(Θϵ)2ϵ\dfrac{\partial}{\partial\Theta}J(\Theta) \approx \dfrac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2\epsilon}

Actually you will compare this gradient with the output of the backpropagation ΘJ(Θ)ΔBackprop\dfrac{\partial}{\partial\Theta}J(\Theta) \approx \Delta_{\text{Backprop}}.

Again this will be really slow because we need to calculate the loss again with this small perturbation twice for each parameter.

For example suppose that you have 3 parameters Θ{θ1,θ2,θ3}\Theta \in \{\theta_1, \theta_2, \theta_3\}

θ1J(Θ)J(θ1+ϵ,θ2,θ3)J(θ1ϵ,θ2,θ3)2ϵ\dfrac{\partial}{\partial\theta_1}J(\Theta) \approx \dfrac{J(\theta_1 + \epsilon, \theta_2, \theta_3) - J(\theta_1 - \epsilon, \theta_2, \theta_3)}{2\epsilon}

θ2J(Θ)J(θ1,θ2+ϵ,θ3)J(θ1,θ2ϵ,θ3)2ϵ\dfrac{\partial}{\partial\theta_2}J(\Theta) \approx \dfrac{J(\theta_1, \theta_2 + \epsilon, \theta_3) - J(\theta_1, \theta_2 - \epsilon, \theta_3)}{2\epsilon}

θ2J(Θ)J(θ1,θ2,θ3+ϵ)J(θ1,θ2,θ3ϵ)2ϵ\dfrac{\partial}{\partial\theta_2}J(\Theta) \approx \dfrac{J(\theta_1, \theta_2, \theta_3 + \epsilon) - J(\theta_1, \theta_2, \theta_3 - \epsilon)}{2\epsilon}

Weights initialization

The way that you initialize your network parameters is also important, you cannot for instance initialize all your weights to zero, normally you want to initialize them with small random values on the range [ϵrand,ϵrand][-\epsilon_{rand}, \epsilon_{rand}] but somehow also take into account that you don't want to have some sort of symmetry of the random values between layers.

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

What we can see from the code above is that we create random numbers independently for each layer, and all of them in between the range [ϵrand,ϵrand][-\epsilon_{rand}, \epsilon_{rand}]

Training steps

Now just to list the steps required to train a neural network. Here we mention the term epoch which means a complete pass intro all elements of your training set. Actually you repeat your training set over and over because the weights don't completely learn a concept in a single epoch. 1. Initialize weights randomly 2. For each epoch 3. Do the forward propagation 4. Calculate loss 5. Do the backward propagation 6. Update weights with Gradient descent (Optionally use gradient checking to verify backpropagation) 7. Go to step 2 until you finish all epochs

Training/Validation/Test data

Some good practices to create your dataset for training your hypothesis models

  1. Collect as many data as possible

  2. Merge/Shuffle all this data

  3. Divide this dataset into train(60%)/validation(20%)/test(20%) set

  4. Avoid having test from a different distribution of your train/validation

  5. Use the validation set to tune your model (Number of layers/neurons)

  6. Check overall performance with the test set

When we say to use the validation set it means that we're going to change parameters of our model and check which one get better results on this validation set, don't touch your training set. If you are having bad results on your test set consider getting more data, and verify if your train/test/val come from the same distribution.

Effects of deep neural networks

As mentioned earlier having deeper and bigger neural networks is always better in terms of recognition performance, but some problems also arise with more complex models.

  1. Deeper and more complex neural networks, need more data to train (10x number of parameters)

  2. Over-fit can become a problem so do regularization (Dropout, L2 regularization)

  3. Prediction time will increase.

Neural networks as computation graphs

In order to calculate the back-propagation, it's easier if you start representing your hypothesis as computation graphs. Also in next chapters we use different types of layers working together, so to simplify development consider the neural networks as computation graphs. The idea is that if you provide for each node of your graph the forward/backward implementation, the back propagation becomes much more easier.

Last updated