Deep learning is a branch of machine learning based on a set of algorithms that learn to represent the data. Below we list the most popular ones.

Convolutional Neural Networks

Deep Belief Networks

Deep Auto-Encoders

Recurrent Neural Networks (RNN/LSTM/GRU)

Generative Adversarial Networks (GAN)

One of the promises of deep learning is that they will substitute hand-crafted feature extraction. The idea is that they will "learn" the best features needed to represent the given data.

Deep learning models are formed by multiple layers. In the context of artificial neural networks the multi layer perceptron (MLP) with more than 2 hidden layers is already a Deep Model. As a rule of thumb deeper models have the potential to perform better than shallow models. The problem is that the more deep you go the more data you will need to avoid over-fitting.

Here we list some of the most used layers 1. Convolution Layer 2. Max/Average Pooling Layer 3. Dropout Layer 4. Batch Normalization Layer 5. Fully Connected (Affine) Layer 6. Relu, Tanh, Sigmoid Layer (Non-Linearity Layers) 7. Softmax, Cross Entropy, SVM, Euclidean (Loss Layers)

Besides getting more data, there are some techniques used to combat over-fitting, here is a list of the most common ones:

Dropout

L2 Regularization

Data Augmentation

It's a technique that randomly turns off some neurons from the fully connected layer during training.

The dropout forces the fully connected layers to learn the same concept in different ways

The most common form of regularization is L2 regularization. In this case we add a term to our loss function that penalizes the squared value of all the weights/parameters that we are optimizing. For each weight $w$ in our neural network we add the term $0.5 \lambda w^2$ to the loss/objective function.

â€‹$\lambda$ is the regularization stength parameter. A half is used just to make things easier when we calculate the derivative for back propagation as it will simply cancel out.

As a result of using this regularization very high value weights are penalized heavily. This encourages our model to prefer that all inputs to a layer are used a little rather than a few inputs are used a lot. This property is intuitively quite nice to have as our model will be used maximally and we have less unused weights.

Besides L2 regularization there is L1 regularization and Max Norm but these aren't discussed here as L2 generally performs better.

It is possible to synthetically create new training examples by applying some transformations on the input data. For example fliping images or randomly shifiting RGB values. During the 2012 Imagenet Competition, Alex Krizhevesky (Alexnet) used data augmentation of a factor of 2048 which meant that the dataset used to train his model was effectively 2048 times larger than at the start and gave an improvement to generalization over not using it.

â€‹â€‹â€‹

The idea is to let the learning algorithm to find the best representation that it can for every layer starting from the inputs to the more deepest ones. The shallow layers learn to represent data on it's simpler form and deepest layers learn to represent the data with the concepts learned from the previous ones.

Actually the only new thing is the usage of models that will learn how to best represent your data (feature selection) automatically and based on the dataset given to it. This is different to in the past where hand crafted features, like HOG (Histogram of Oriented Gradients), were used. Thinking up with these new features could take along time and were not guaranteed to be optimal for every dataset.

The biggest advantage of this new way is that if your problem gets more complex you just make your model "deeper" and get more data (a lot) to train to your new problem and the model will then learn what is the best features for your paticular task.

Next chapter we will learn about Convolution Neural Networks.