Taking the Coursera Deep Learning Specialization, Sequence Models course. Will post condensed notes every week as part of the review process. All material originates from the free Coursera course, taught by Andrew Ng. See deeplearning.ai for more details.

Table of Contents

Recurrent Neural Networks

Recurrent Neural Networks

Why sequence models



Motivating example. Named entity recognition.

x: Harry Potter and Herminone Granger invented a new spell.

y: [1, 1, 0, 1, 1, 0, 0, 0, 0] #is a word part of a person's name?

Index into the input/output positions is angle brackets. Index starting by 1.

$$ x^{<1>} = \text{Harry} $$

$$ y^{<3>} = 0 $$

Length of the input sequence is denoted by $T_x$. Length of the output sequence is denoted by $T_{y}$. These don’t have to be the same.

$$ T_{x} = 9 $$ $$ T_{y} = 9 $$

For multiple examples, use superscript round brackets.

For instance, the second training example, 3rd word would be represented by $x^{(2)<3>}$.

Might be useful to represent the words as a value.


Create a vocabulary dictionary where each word is laid out in an array from A to Z. (Dictionary sizes of up to 100,000 is not uncommon.) The word ‘a’ could be value 1. The word ‘and’ could be 367. This allows us to convert our sentence into a matrix of numbers.

Words are a one-hot array. One-hot means only one value is set, everything else is 0.

Recurrent Neural Network Model

Why not a standard neural network? - Input and outputs can be different lengths in different examples. - Naive neural networks do not share features learned across different positions of text. - in a convolutional neural network, features are shared throughout the image, but this is less useful when ordering is important (ie: time)

Recurrent neural networks are networks where the activations calculated from the first word/sequence item are passed onto the second word/sequence item.


One weakness of this model (one directional recurrent neural network) is it doesn’t use the future sequence items to calculate the initial sequence item’s meaning.


Forward propagation steps:

$$ a^{<t>} = g(W_{aa}a^{<t-1>} + W_{ax}x^{<t>} + b_a) $$ $$ \hat{y}^{<t>} = g(W_{ya}a^{<t>} + b_y) $$

this can be simplified to:

$$ a^{<t>} = g(W_{a} [a^{<t-1>}, x^{<t>}] + b_a ) $$ $$ \hat{y}^{<t>} = g(W_{y}a^{<t>} + b_{y}) $$


Backpropagation through time

Forward propagation recall. forward_propgation_rnn_graph

Loss function for a particular element in the sequence:

$$ \mathcal{L}^{<t>}(\hat{y}^{<t>}, y^{<t>}) = -y^{<t>} \log{\hat{y}^{<t>}} - (1 - y^{<t>}) \log{(1-\hat{y}^{<t>})} $$

Loss function for the entire sequence.

$$ \mathcal{L}(\hat{y}, y) = \sum\limits_{t=1}^{T_y} \mathcal{L}^{<t>}(\hat{y}^{<t>}, y^{<t>}) $$

Name for this is called Backpropagation through time. backpropagation_through_time

Different types of RNNs

So far, the example shown had $T_x == T_y$. This is not always the case.

Inspired by: The Unreasonable Effectiveness of Recurrent Neural Networks




Language model and sequence generation

What is language modelling? How does a machine tell the difference between The apple and pair salad, and The apple and pear salad?

Language modeler estimates the probability of a sequence of words. Training set requires a large corpus of english text.

Turn a sentence into a token. Turn a sentence into ‘one hot vectors’. Another common thing to do is to model the end of sentences. <EOS> token language_modelling_with_rnn

The RNN model is trying to determine the next item in the sequence given all of the items provided in the sequence earlier.


Sampling novel sequences

Word level language model generation_of_random_sentence

Character level language model generation_of_random_sentence_char - no unknown word token - more computationally expensive and more difficult to capture longer term patterms

Vanishing gradients with RNNs

One of the problems of the basic RNN is vanishing gradient problem.

Consider the following sentences: - The cat, which already ate ten apples and three pears, was full. - The cats, which already ate ten apples and three pears, were full.

How do you capture the long term dependency of cat -> was and cats -> were? The stuff in the middle can be arbitraily long. Difficult for an item in the sequence to be influenced by values much earlier/later in the sequence.

vanishing_gradients_rnn Apply gradient clipping if your gradient starts to explode. Gradient Clipping

Gated Recurrent Unit (GRU)

Improvement to RNN to help capture long term dependencies. For reference, this is the basic recurrent neural network unit.


GRU (simplified) has a memory cell.

$$ c = \text{memory cell} $$ $$ c^{<t>} = a^{<t>} $$ $$ \tilde{c}^{<t>} = \tanh({W_c [c^{<t-1>}, x^{<t>}] + b_c}) $$ - The candidate new memory cell value $$ \Gamma_u = \sigma({W_u [c^{<t-1>}, x^{<t>}] + b_u}) $$ - Determine if this should be updated or not? The ‘u’ stands for update. Capital Gamma stands for Gate. $$ c^{<t>} = \Gamma_u * \tilde{c}^{<t>} + (1 - \Gamma_u) * c^{<t-1>} $$ - The new memory cell value

simplified_gru - Helps the neural network learn very long term dependencies because Gamma is either close to 0 or close to 1.

There is an additional ‘gate’, which takes the relevance into question to determine whether or not to update the memory cell.


Long Short Term Memory (LSTM)


LSTM Functions

$$ \tilde{c}^{<t>} = \tanh{(w_c [a^{<t-1>}, x^{<t>}] + b_c)} $$ $$ \Gamma_u = \sigma(w_u [a^{<t-1>}, x^{<t>}] + b_u)$$ $$ \Gamma_f = \sigma(w_f [a^{<t-1>}, x^{<t>}] + b_f)$$ $$ \Gamma_o = \sigma(w_o [a^{<t-1>}, x^{<t>}] + b_o)$$ $$ c^{<t>} = \Gamma_u * \tilde{c}^{<t>} + \Gamma_f * c^{<t-1>} $$ $$ a^{<t>} = \Gamma_o * c^{<t>} $$

The LSTM is similar to GRU, but there are a few notable differences.


There isn’t a widespread consensus as to when to use a GRU and when to use an LSTM. Neither algorithm is universally superior. GRU is computationally simplier. LSTM is more powerful and flexible.

Bidirectional RNN

Bidirectional RNNS allow you to take information from both earlier and later in the sequence.


Forward propagation is run once from the sequence starting from the beginning to end. Simultaneously, forward propagation is run once from the sequence starting from the end going to the beginning.

The activation function $g$ is applied on the two blocks at each sequence item.


Disadvantage of this is the computation is now doubled. Also need to calculate the entire sequence before you can make predictions. When you are doing speech processing, you have to wait until the person stops talking before you can make a prediction.

Deep RNNs

Added notation, square bracket superscript represents layer number.


Recurrent Neural Networks can be stacked on top of one another. Three layers is usually plenty enough.