Recurrent Neural Networks
Last updated
Last updated
On previous forward neural networks, our output was a function between the current input and a set of weights. On recurrent neural networks(RNN), the previous network state is also influence the output, so recurrent neural networks also have a "notion of time". This effect by a loop on the layer output to it's input.
In other words, the RNN will be a function with inputs (input vector) and previous state . The new state will be . The recurrent function, , will be fixed after training and used to every time step.
Recurrent Neural Networks are the best model for regression, because it take into account past values.
RNN are computation "Turing Machines" which means, with the correct set of weights it can compute anything, imagine this weights as a program. Just to not let you too overconfident on RNN, there is no automatic back-propagation algorithms, that will find this "perfect set of weights".
Machine translation (English --> French)
Speech to text
Market prediction
Scene labelling (Combined with CNN)
Car wheel steering. (Combined with CNN)
Bellow we have a simple implementation of RNN recurrent function: (Vanilla version)
The code that calculate up to the next state looks like this:
Before we start let's just make explicit how to backpropagate the tanh block.
Now we can do the backpropagation step (For one single time-step)
These looping feature on RNNs can be confusing first but actually you can think as a normal neural network repeated(unrolled) multiple times. The number of times that you unroll can be consider how far in the past the network will remember. In other words each time is a time-step.
From the previous examples we presented code for forward and backpropagation for one time-step only. As presented before the RNN are unroled for each time-step (finite). Now we present how to do the forward propagation for each time-step.
Bellow we show a diagram that present the multiple ways that you could use a recurrent neural network compared to the forward networks. Consider the inputs the red blocks, and the outputs the blue blocks.
One to one: Normal Forward network, ie: Image on the input, label on the output
One to many(RNN): (Image captioning) Image in, words describing the scene out (CNN regions detected + RNN)
Many to one(RNN): (Sentiment Analysis) Words on a phrase on the input, sentiment on the output (Good/Bad) product.
Many to many(RNN): (Translation), Words on English phrase on input, Portuguese on output.
Many to many(RNN): (Video Classification) Video in, description of video on output.
Bellow we describe how we add "depth" to RNN and also how to unroll RNNs to deal with time. Observe that the output of the RNNs are feed to deeper layers, while the state is feed for dealing with past states.
Here we present a simple case where we want the RNN to complete the word, we give to the network the characters h,e,l,l , our vocabulary here is [h,e,l,o]. Observe that after we input the first 'h' the network want's to output the wrong answer (right is on green), but near the end, after the second 'l' it want's to output the right answer 'o'. Here the order that the characters come in does matter.
If you connect a convolution neural network, with pre-trained RNN. The RNN will be able to describe what it "see" on the image.
Basically we get a pre-trained CNN (ie: VGG) and connect the second-to-last FC layer and connect to a RNN. After this you train the whole thing end-to-end.
In other words LSTM suffer much less from vanishing gradients than normal RNNs. Remember that the plus gates distribute the gradients.
So by suffering less from vanishing gradients, the LSTMs can remember much more in the past. So from now just use LSTMs when you think about RNN.
Also in other words LSTM are better to remember long term dependencies.
Observe from the animation bellow how hast the gradients on the RNN disappear compared to LSTM.
The vanishing problem can be solved with LSTM, but another problem that can happen with all recurrent neural network is the exploding gradient problem.
To fix the exploding gradient problem, people normally do a gradient clipping, that will allow only a maximum gradient value.
This highway for the gradients is called Cell-State, so one difference compared to the RNN that has only the state flowing, on LSTM we have states and the cell state.
Doing a zoom on the LSTM gate. This also improves how to do the backpropagation.
Code for lstm forward propagation for one time-step
Now the backward propagation for one time-step
The Gru cells can be considered as a variant of the LSTM (Also want's to fight vanishing gradients) cell, but more computational efficient. On this cell the forget and input gates are merged (update gate).
Observe that in our case of RNN we are now more interested on the next state, not exactly the output,
A point to be noted is that the same function and the same set of parameters will be applied to every time-step.
A good initialization for the RNN states is zero. Again this is just the initial RNN state not it's weights.
LSTM provides a different recurrent formula , it's more powefull than vanilla RNN, due to it's complex that add "residual information" to the next state instead of just transforming each state. Imagine LSTM are the "residual" version of RNNs.