Taking the Coursera Deep Learning Specialization, Sequence Models course. Will post condensed notes every week as part of the review process. All material originates from the free Coursera course, taught by Andrew Ng. See deeplearning.ai for more details.

# Recurrent Neural Networks

• Learn about recurrent neural networks.
• This type of model has been proven to perform well on temporal data (data involving time)
• Several variants:
• LSTM
• GRU
• Bidirectional RNN

## Recurrent Neural Networks

### Why sequence models

• In speech recognition, you are given an audio clip and asked to output a sentence.
• In music generation, you are trying to output a sequence of notes
• In sentiment classification, you are given a sentence and trying to determine rating, or analysis of the phrase (happy/sad, etc.)
• DNA sequence analysis; your DNA is represented by AGCT and you can use ML to label whether or not this sequence represents a protein
• Machine translation; one sentence to another sentence
• etc.

### Notation

Motivating example. Named entity recognition.

x: Harry Potter and Herminone Granger invented a new spell.

y: [1, 1, 0, 1, 1, 0, 0, 0, 0] #is a word part of a person's name?

Index into the input/output positions is angle brackets. Index starting by 1.

$$x^{<1>} = \text{Harry}$$

$$y^{<3>} = 0$$

Length of the input sequence is denoted by $T_x$. Length of the output sequence is denoted by $T_{y}$. These don’t have to be the same.

$$T_{x} = 9$$ $$T_{y} = 9$$

For multiple examples, use superscript round brackets.

For instance, the second training example, 3rd word would be represented by $x^{(2)<3>}$.

Might be useful to represent the words as a value.

Create a vocabulary dictionary where each word is laid out in an array from A to Z. (Dictionary sizes of up to 100,000 is not uncommon.) The word ‘a’ could be value 1. The word ‘and’ could be 367. This allows us to convert our sentence into a matrix of numbers.

Words are a one-hot array. One-hot means only one value is set, everything else is 0.

### Recurrent Neural Network Model

Why not a standard neural network? - Input and outputs can be different lengths in different examples. - Naive neural networks do not share features learned across different positions of text. - in a convolutional neural network, features are shared throughout the image, but this is less useful when ordering is important (ie: time)

Recurrent neural networks are networks where the activations calculated from the first word/sequence item are passed onto the second word/sequence item.

One weakness of this model (one directional recurrent neural network) is it doesn’t use the future sequence items to calculate the initial sequence item’s meaning.

Forward propagation steps:

$$a^{<t>} = g(W_{aa}a^{<t-1>} + W_{ax}x^{<t>} + b_a)$$ $$\hat{y}^{<t>} = g(W_{ya}a^{<t>} + b_y)$$

this can be simplified to:

$$a^{<t>} = g(W_{a} [a^{<t-1>}, x^{<t>}] + b_a )$$ $$\hat{y}^{<t>} = g(W_{y}a^{<t>} + b_{y})$$

### Backpropagation through time

Forward propagation recall.

Loss function for a particular element in the sequence:

$$\mathcal{L}^{<t>}(\hat{y}^{<t>}, y^{<t>}) = -y^{<t>} \log{\hat{y}^{<t>}} - (1 - y^{<t>}) \log{(1-\hat{y}^{<t>})}$$

Loss function for the entire sequence.

$$\mathcal{L}(\hat{y}, y) = \sum\limits_{t=1}^{T_y} \mathcal{L}^{<t>}(\hat{y}^{<t>}, y^{<t>})$$

Name for this is called Backpropagation through time.

### Different types of RNNs

So far, the example shown had $T_x == T_y$. This is not always the case.

• Many to Many, where the input length and output length are the same length.
• Many to One, where for instance you are trying to determine the rating of a length of text (whether a sentence is happy or not for instance)
• One to one, a standard neural net (not really recurrant)

• One to Many, such as seeding a music generation neural network
• Many to Many, where the input length and output length are different. Network is broken into two parts, an encoder and a decoder.

### Language model and sequence generation

What is language modelling? How does a machine tell the difference between The apple and pair salad, and The apple and pear salad?

Language modeler estimates the probability of a sequence of words. Training set requires a large corpus of english text.

Turn a sentence into a token. Turn a sentence into ‘one hot vectors’. Another common thing to do is to model the end of sentences. <EOS> token

The RNN model is trying to determine the next item in the sequence given all of the items provided in the sequence earlier.

### Sampling novel sequences

Word level language model

Character level language model - no unknown word token - more computationally expensive and more difficult to capture longer term patterms

One of the problems of the basic RNN is vanishing gradient problem.

Consider the following sentences: - The cat, which already ate ten apples and three pears, was full. - The cats, which already ate ten apples and three pears, were full.

How do you capture the long term dependency of cat -> was and cats -> were? The stuff in the middle can be arbitraily long. Difficult for an item in the sequence to be influenced by values much earlier/later in the sequence.

### Gated Recurrent Unit (GRU)

Improvement to RNN to help capture long term dependencies. For reference, this is the basic recurrent neural network unit.

GRU (simplified) has a memory cell.

$$c = \text{memory cell}$$ $$c^{<t>} = a^{<t>}$$ $$\tilde{c}^{<t>} = \tanh({W_c [c^{<t-1>}, x^{<t>}] + b_c})$$ - The candidate new memory cell value $$\Gamma_u = \sigma({W_u [c^{<t-1>}, x^{<t>}] + b_u})$$ - Determine if this should be updated or not? The ‘u’ stands for update. Capital Gamma stands for Gate. $$c^{<t>} = \Gamma_u * \tilde{c}^{<t>} + (1 - \Gamma_u) * c^{<t-1>}$$ - The new memory cell value

- Helps the neural network learn very long term dependencies because Gamma is either close to 0 or close to 1.

There is an additional ‘gate’, which takes the relevance into question to determine whether or not to update the memory cell.

### Long Short Term Memory (LSTM)

LSTM Functions

$$\tilde{c}^{<t>} = \tanh{(w_c [a^{<t-1>}, x^{<t>}] + b_c)}$$ $$\Gamma_u = \sigma(w_u [a^{<t-1>}, x^{<t>}] + b_u)$$ $$\Gamma_f = \sigma(w_f [a^{<t-1>}, x^{<t>}] + b_f)$$ $$\Gamma_o = \sigma(w_o [a^{<t-1>}, x^{<t>}] + b_o)$$ $$c^{<t>} = \Gamma_u * \tilde{c}^{<t>} + \Gamma_f * c^{<t-1>}$$ $$a^{<t>} = \Gamma_o * c^{<t>}$$

The LSTM is similar to GRU, but there are a few notable differences.

• LSTM has three gates. An Update Gate, a Forget Gate, and an Output Gate.
• LSTM does not equate $a^{<t>} == c^{<t>}$.

There isn’t a widespread consensus as to when to use a GRU and when to use an LSTM. Neither algorithm is universally superior. GRU is computationally simplier. LSTM is more powerful and flexible.

### Bidirectional RNN

Bidirectional RNNS allow you to take information from both earlier and later in the sequence.

Forward propagation is run once from the sequence starting from the beginning to end. Simultaneously, forward propagation is run once from the sequence starting from the end going to the beginning.

The activation function $g$ is applied on the two blocks at each sequence item.

Disadvantage of this is the computation is now doubled. Also need to calculate the entire sequence before you can make predictions. When you are doing speech processing, you have to wait until the person stops talking before you can make a prediction.

### Deep RNNs

Added notation, square bracket superscript represents layer number.

Recurrent Neural Networks can be stacked on top of one another. Three layers is usually plenty enough.