# Recurrent Neural Networks

## Introduction

On previous forward neural networks, our output was a function between the current input and a set of weights. On recurrent neural networks(RNN), the previous network state is also influence the output, so recurrent neural networks also have a "notion of time". This effect by a loop on the layer output to it's input.

Recurrent Neural Networks are the best model for regression, because it take into account past values.

RNN are computation "Turing Machines" which means, with the correct set of weights it can compute anything, imagine this weights as a program. Just to not let you too overconfident on RNN, there is no automatic back-propagation algorithms, that will find this "perfect set of weights".

#### Use cases of recurrent neural networks

Machine translation (English --> French)

Speech to text

Market prediction

Scene labelling (Combined with CNN)

Car wheel steering. (Combined with CNN)

### Implementing Vanilla RNN on python

Bellow we have a simple implementation of RNN recurrent function: (Vanilla version)

$\Large ht=f_{weights}(h_{t-1},x_t) \therefore \\ \Large h_t=tanh(W_{hh}.h_{t-1} + W_{xh}.x_t) \\ \Large y_t = W_{hy}.h_t$

The code that calculate up to the next state $h_t$ looks like this:

Observe that in our case of RNN we are now more interested on the next state, $h_t$ not exactly the output, $y_t$

Before we start let's just make explicit how to backpropagate the tanh block.

Now we can do the backpropagation step (For one single time-step)

A point to be noted is that the same function $f_{weights}$ and the same set of parameters will be applied to every time-step.

A good initialization for the RNN states $h_t$ is zero. Again this is just the initial RNN state not it's weights.

These looping feature on RNNs can be confusing first but actually you can think as a normal neural network repeated(unrolled) multiple times. The number of times that you unroll can be consider how far in the past the network will remember. In other words each time is a time-step.

### Forward and backward propagation on each time-step

From the previous examples we presented code for forward and backpropagation for one time-step only. As presented before the RNN are unroled for each time-step (finite). Now we present how to do the forward propagation for each time-step.

Bellow we show a diagram that present the multiple ways that you could use a recurrent neural network compared to the forward networks. Consider the inputs the red blocks, and the outputs the blue blocks.

One to one: Normal Forward network, ie: Image on the input, label on the output

One to many(RNN): (Image captioning) Image in, words describing the scene out (CNN regions detected + RNN)

Many to one(RNN): (Sentiment Analysis) Words on a phrase on the input, sentiment on the output (Good/Bad) product.

Many to many(RNN): (Translation), Words on English phrase on input, Portuguese on output.

Many to many(RNN): (Video Classification) Video in, description of video on output.

### Stacking RNNs

Bellow we describe how we add "depth" to RNN and also how to unroll RNNs to deal with time. Observe that the output of the RNNs are feed to deeper layers, while the state is feed for dealing with past states.

### Simple regression example

Here we present a simple case where we want the RNN to complete the word, we give to the network the characters h,e,l,l , our vocabulary here is [h,e,l,o]. Observe that after we input the first 'h' the network want's to output the wrong answer (right is on green), but near the end, after the second 'l' it want's to output the right answer 'o'. Here the order that the characters come in does matter.

### Describing images

If you connect a convolution neural network, with pre-trained RNN. The RNN will be able to describe what it "see" on the image.

Basically we get a pre-trained CNN (ie: VGG) and connect the second-to-last FC layer and connect to a RNN. After this you train the whole thing end-to-end.

### Long Short Term Memory networks(LSTM)

LSTM provides a different recurrent formula $f_W$, it's more powefull than vanilla RNN, due to it's complex $f_W$ that add "residual information" to the next state instead of just transforming each state. Imagine LSTM are the "residual" version of RNNs.

In other words LSTM suffer much less from vanishing gradients than normal RNNs. Remember that the plus gates distribute the gradients.

So by suffering less from vanishing gradients, the LSTMs can remember much more in the past. So from now just use LSTMs when you think about RNN.

Also in other words LSTM are better to remember long term dependencies.

Observe from the animation bellow how hast the gradients on the RNN disappear compared to LSTM.

The vanishing problem can be solved with LSTM, but another problem that can happen with all recurrent neural network is the exploding gradient problem.

To fix the exploding gradient problem, people normally do a gradient clipping, that will allow only a maximum gradient value.

This highway for the gradients is called Cell-State, so one difference compared to the RNN that has only the state flowing, on LSTM we have states and the cell state.

### LSTM Gate

Doing a zoom on the LSTM gate. This also improves how to do the backpropagation.

Code for lstm forward propagation for one time-step

Now the backward propagation for one time-step

### GRU (Gated Recurrent Unit) Cells

The Gru cells can be considered as a variant of the LSTM (Also want's to fight vanishing gradients) cell, but more computational efficient. On this cell the forget and input gates are merged (update gate).

Last updated