Recurrent Neural Networks

Introduction

On previous forward neural networks, our output was a function between the current input and a set of weights. On recurrent neural networks(RNN), the previous network state is also influence the output, so recurrent neural networks also have a "notion of time". This effect by a loop on the layer output to it's input.

In other words, the RNN will be a function with inputs $x_t$ (input vector) and previous state $h_{t-1}$ . The new state will be $h_t$ . The recurrent function, $f_W$ , will be fixed after training and used to every time step.

Recurrent Neural Networks are the best model for regression, because it take into account past values.

RNN are computation "Turing Machines" which means, with the correct set of weights it can compute anything, imagine this weights as a program. Just to not let you too overconfident on RNN, there is no automatic back-propagation algorithms, that will find this "perfect set of weights".

Use cases of recurrent neural networks

Machine translation (English --> French)
Speech to text
Market prediction
Scene labelling (Combined with CNN)
Car wheel steering. (Combined with CNN)

Implementing Vanilla RNN on python

Bellow we have a simple implementation of RNN recurrent function: (Vanilla version)

$\Large ht=f_{weights}(h_{t-1},x_t) \therefore \\ \Large h_t=tanh(W_{hh}.h_{t-1} + W_{xh}.x_t) \\ \Large y_t = W_{hy}.h_t$

The code that calculate up to the next state $h_t$ looks like this:

def rnn_step_forward(x, prev_h, Wx, Wh, b):
  # We separate on steps to make the backpropagation easier
  #forward pass in steps
  # step 1
  xWx = np.dot(x, Wx)

  # step 2
  phWh = np.dot(prev_h,Wh)

  # step 3
  # total
  affine = xWx + phWh + b.T

  # step 4
  next_h = np.tanh(t)

  # Cache iputs, state, and weights
  # we are having prev_h.copy() since python params are pass by reference.
  cache = (x, prev_h.copy(), Wx, Wh, next_h, affine)

  return next_h, cache

Observe that in our case of RNN we are now more interested on the next state, $h_t$ not exactly the output, $y_t$

Before we start let's just make explicit how to backpropagate the tanh block.

Now we can do the backpropagation step (For one single time-step)

def rnn_step_backward(dnext_h, cache):
    (x, prev_h, Wx, Wh, next_h, affine) = cache

    #backward in step
    # step 4
    # dt delta of total
    # Gradient of tanh times dnext_h
    dt = (1 - np.square(np.tanh(affine))) * (dnext_h)

    # step 3
    # Gradient of sum block
    dxWx = dt
    dphWh = dt
    db = np.sum(dt, axis=0)

    # step 2
    # Gradient of the mul block
    dWh = prev_h.T.dot(dphWh)
    dprev_h = Wh.dot(dphWh.T).T

    # step 1
    # Gradient of the mul block
    dx = dxWx.dot(Wx.T)
    dWx = x.T.dot(dxWx)

    return dx, dprev_h, dWx, dWh, db

A point to be noted is that the same function $f_{weights}$ and the same set of parameters will be applied to every time-step.

A good initialization for the RNN states $h_t$ is zero. Again this is just the initial RNN state not it's weights.

These looping feature on RNNs can be confusing first but actually you can think as a normal neural network repeated(unrolled) multiple times. The number of times that you unroll can be consider how far in the past the network will remember. In other words each time is a time-step.

Forward and backward propagation on each time-step

From the previous examples we presented code for forward and backpropagation for one time-step only. As presented before the RNN are unroled for each time-step (finite). Now we present how to do the forward propagation for each time-step.

def rnn_forward(x, h0, Wx, Wh, b):
  """
  Run a vanilla RNN forward on an entire sequence of data. We assume an input
  sequence composed of T vectors, each of dimension D. The RNN uses a hidden
  size of H, and we work over a minibatch containing N sequences. After running
  the RNN forward, we return the hidden states for all timesteps.

  Inputs:
  - x: Input data for the entire timeseries, of shape (N, T, D).
  - h0: Initial hidden state, of shape (N, H)
  - Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
  - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
  - b: Biases of shape (H,)

  Returns a tuple of:
  - h: Hidden states for the entire timeseries, of shape (N, T, H).
  - cache: Values needed in the backward pass
  """

  # Get shapes
  N, T, D = x.shape
  # Initialization
  h, cache = None, None
  H = h0.shape[1]
  h = np.zeros((N,T,H))

  # keeping the inital value in the last element
  # it will be overwritten
  h[:,-1,:] = h0
  cache = []

  # For each time-step
  for t in xrange(T):
    h[:,t,:], cache_step = rnn_step_forward(x[:,t,:], h[:,t-1,:], Wx, Wh, b)
    cache.append(cache_step)

  # Return current state and cache
  return h, cache

def rnn_backward(dh, cache):
  """
  Compute the backward pass for a vanilla RNN over an entire sequence of data.

  Inputs:
  - dh: Upstream gradients of all hidden states, of shape (N, T, H)

  Returns a tuple of:
  - dx: Gradient of inputs, of shape (N, T, D)
  - dh0: Gradient of initial hidden state, of shape (N, H)
  - dWx: Gradient of input-to-hidden weights, of shape (D, H)
  - dWh: Gradient of hidden-to-hidden weights, of shape (H, H)
  - db: Gradient of biases, of shape (H,)
  """
  dx, dh0, dWx, dWh, db = None, None, None, None, None
  # Get shapes
  N,T,H = dh.shape
  D = cache[0][0].shape[1] # D taken from x in cache

  # Initialization keeping the gradients with the same shape it's respective inputs/weights
  dx, dprev_h = np.zeros((N, T, D)),np.zeros((N, H))
  dWx, dWh, db = np.zeros((D, H)), np.zeros((H, H)), np.zeros((H,))
  dh = dh.copy()

  # For each time-step
  for t in reversed(xrange(T)):
    dh[:,t,:]  += dprev_h # updating the previous layer dh
    dx_, dprev_h, dWx_, dWh_, db_ = rnn_step_backward(dh[:,t,:], cache[t])
    # Observe that we sum each time-step gradient
    dx[:,t,:] += dx_
    dWx += dWx_
    dWh += dWh_
    db += db_

  dh0 = dprev_h

  return dx, dh0, dWx, dWh, db

Bellow we show a diagram that present the multiple ways that you could use a recurrent neural network compared to the forward networks. Consider the inputs the red blocks, and the outputs the blue blocks.

One to one: Normal Forward network, ie: Image on the input, label on the output
One to many(RNN): (Image captioning) Image in, words describing the scene out (CNN regions detected + RNN)
Many to one(RNN): (Sentiment Analysis) Words on a phrase on the input, sentiment on the output (Good/Bad) product.
Many to many(RNN): (Translation), Words on English phrase on input, Portuguese on output.
Many to many(RNN): (Video Classification) Video in, description of video on output.

Stacking RNNs

Bellow we describe how we add "depth" to RNN and also how to unroll RNNs to deal with time. Observe that the output of the RNNs are feed to deeper layers, while the state is feed for dealing with past states.

Simple regression example

Here we present a simple case where we want the RNN to complete the word, we give to the network the characters h,e,l,l , our vocabulary here is [h,e,l,o]. Observe that after we input the first 'h' the network want's to output the wrong answer (right is on green), but near the end, after the second 'l' it want's to output the right answer 'o'. Here the order that the characters come in does matter.

Describing images

If you connect a convolution neural network, with pre-trained RNN. The RNN will be able to describe what it "see" on the image.

Basically we get a pre-trained CNN (ie: VGG) and connect the second-to-last FC layer and connect to a RNN. After this you train the whole thing end-to-end.

Long Short Term Memory networks(LSTM)

LSTM provides a different recurrent formula $f_W$ , it's more powefull than vanilla RNN, due to it's complex $f_W$ that add "residual information" to the next state instead of just transforming each state. Imagine LSTM are the "residual" version of RNNs.

In other words LSTM suffer much less from vanishing gradients than normal RNNs. Remember that the plus gates distribute the gradients.

So by suffering less from vanishing gradients, the LSTMs can remember much more in the past. So from now just use LSTMs when you think about RNN.

Also in other words LSTM are better to remember long term dependencies.

Observe from the animation bellow how hast the gradients on the RNN disappear compared to LSTM.

The vanishing problem can be solved with LSTM, but another problem that can happen with all recurrent neural network is the exploding gradient problem.

To fix the exploding gradient problem, people normally do a gradient clipping, that will allow only a maximum gradient value.

This highway for the gradients is called Cell-State, so one difference compared to the RNN that has only the state flowing, on LSTM we have states and the cell state.

LSTM Gate

Doing a zoom on the LSTM gate. This also improves how to do the backpropagation.

Code for lstm forward propagation for one time-step

def lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b):
  N,H = prev_c.shape

  #forward pass in steps
  # step 1: calculate activation vector
  a = np.dot(x, Wx) + np.dot(prev_h,Wh) + b.T

  # step 2: input gate
  a_i = sigmoid(a[:,0:H])

  # step 3: forget gate
  a_f = sigmoid(a[:,H:2*H])

  # step 4: output gate
  a_o = sigmoid(a[:,2*H:3*H])

  # step 5: block input gate
  a_g= np.tanh(a[:,3*H:4*H])

  # step 6: next cell state
  next_c = a_f * prev_c +  a_i * a_g

  # step 7: next hidden state
  next_h = a_o * np.tanh(next_c)

  # we are having *.copy() since python params are pass by reference.
  cache = (x, prev_h.copy(), prev_c.copy(), a, a_i, a_f, a_o, a_g, next_h, next_c, Wx, Wh)

  return next_h, next_c, cache

Now the backward propagation for one time-step

def lstm_step_backward(dnext_h, dnext_c, cache):
  (x, prev_h, prev_c, a, a_i, a_f, a_o, a_g, next_h, next_c, Wx, Wh) = cache
  N,H = dnext_h.shape
  da = np.zeros(a.shape)

  # step 7:
  dnext_c = dnext_c.copy()
  dnext_c += dnext_h * a_o * (1 - np.tanh(next_c) ** 2)
  da_o = np.tanh(next_c) * dnext_h

  # step 6:
  da_f    = dnext_c * prev_c
  dprev_c = dnext_c * a_f
  da_i    = dnext_c * a_g
  da_g    = dnext_c * a_i

  # step 5:
  da[:,3*H:4*H] = (1 - np.square(a_g)) * da_g

  # step 4:
  da[:,2*H:3*H] = (1 - a_o) * a_o * da_o

  # step 3:
  da[:,H:2*H] = (1 - a_f) * a_f * da_f

  # step 2:
  da[:,0:H] = (1 - a_i) * a_i * da_i

  # step 1:
  db = np.sum(da, axis=0)
  dx = da.dot(Wx.T)
  dWx = x.T.dot(da)
  dprev_h = da.dot(Wh.T)
  dWh = prev_h.T.dot(da)

  return dx, dprev_h, dprev_c, dWx, dWh, db

GRU (Gated Recurrent Unit) Cells

The Gru cells can be considered as a variant of the LSTM (Also want's to fight vanishing gradients) cell, but more computational efficient. On this cell the forget and input gates are merged (update gate).

PreviousModel Initialization NextMachine Translation Using RNN

Last updated 5 years ago

Was this helpful?