Taking the Coursera Deep Learning Specialization, Sequence Models course. Will post condensed notes every week as part of the review process. All material originates from the free Coursera course, taught by Andrew Ng. See deeplearning.ai for more details.

# Natural Langauge Processing & Word Embeddings

• Learn about how to use deep learning for natraul language processing.
• Use word vector representations and embedding layers to train recurrent neural networks with great performance.
• Learn to perform sentiment analysis, named entity recognition, machine translation.

## Introduction to Word Embeddings

### Word Representation

• So far, words have been represented with a 1-hot vector of a word vocabular list.
• One of the weaknesses of this representation is it treats each word as a thing of it self. It’s difficult to generalize across different words.
• The inner product between two one-hot vectors is zero.

It would be better if each word could be represented by features.

For instance, a word could have a gender associated with it.

The word man could have gender -1 and the word woman could have gender +1 while a word like apple could have gender 0.

Notation is $e_5391$ where the subscript value is the original one hot vector index, but $e$ is referring to the feature vector instead of the one-hot vector.

It’s common to visualize word embeddings in a 2D plane (using an algorithm like t-SNE). These are called embeddings, as each word is applied to a point in a multi-dimensional space (each point has it’s own space). T-SNE allows you to visualize this in a lower diemsntional space.

### Using word embeddings

Named entity recognition, trying to detect people’s names in a sentence.

Word embeddings can look at 1B to 100B words. (this is common) Training set can be around 100K words.

This knowledge can be transferred to named entity recognition, as you can train your neural network’s word embeddings on text found on the internet.

1. Learn word embeddings from a large text corpus (1-100B words). Or you can downloda pre-trained embedding online
2. Transfer embedding to new task with a smaller training set (say 100k words.) Rather than using a 10,000 one hot vector, you can now use a 300 dimension dense vector.
3. Optional: Continue to fine tune the word embeddings with new data.

Finally, word embeddings have an interesting relationship to face encoding.For face recognition, recall the siamese network training for generating encoding for the input image’s face. The image encoding is similar to the word embedding, however word embeddings usually have a fixed size dictionary and a unknown variable.

### Properties of word embeddings

Suppose that you are given the question “Man is to woman, as King is to ?”

Is it possible to have the neural network answer this question using the word embeddings? Yes it is. One interesting property of word embeddings is that the you can subtract the vector $e_{\text{man}} - e_{\text{woman}}$ you can compare that to the vector $e_{\text{king}} - e_{\text{queen}}$.

Your algorithm can first subtract the vectors man/woman to calculate the difference for finding a similar analogy for king to ?.

In pictures, perhaps the word embedding is in 300 dimensional space. The vector difference between Man and Woman is very similar to the vector difference between King and Queen. The arrow difference in the slide below represents a difference in gender.

Try to find the word $w$ such that the following equation holds true. $$e_{\text{man}} - e_{\text{woman}} \approx e_{\text{king}} - e_{\text{queen}}$$

Find word w: $\text{arg} \max\limits_w \text{similarity}(e_w, e_{\text{king}} - e_{\text{man}} + e_{\text{woman}})$

If you learn a set of word embeddings, you can find analogies using word vectors with decent accuracy.

The most commonly used similarity function is Cosine similarity.

$$\text{similarity}(u, v) = \dfrac{u^Tv}{||u||_2||v||_2}$$

Can also use euclidian distance. $||u-v||^2$.

Things it can learn:

• Man:Woman as Boy:Girl
• Big:Bigger as Tall:Taller
• Yen:Japan as Ruble:Russia

### Embedding matrix

When you implement an algorithm to learn an embedding, you’re learning an embedding matrix.

Take for instance, a 10,000 word vocaulary. [a, aaron, orange, ... zulu, <UNK>]

The let’s make this matrix E = 300 by 10,000. If Orange was indexed 6257,

$O_{6257}$ is the one hot vector of 10,000 rows with a 1 at the 6257th position.

$$E * O_j = e_j$$

Initialize E randomlly, then use gradient descent to learn the parameters of the embedding matrix.

In practice, you use a specialized function to do the multiplication, as matrix multiplication is innefficient with many one hot vectors. Keras has an embedding module that does this for you.

## Learning Word Embeddings: Word2vec & GloVe

### Learning word embeddings

Lets say you’re building a language model. Building a neural language model is a reasonable way to learn word embeddings.

You can have various context and target pairs.

### Word2Vec

Recall the above Skip-grams. Let’s say you’re given the sentence “I want a glass of orange juice to go along with my cereal”. Rather than having context be the immediate last word, randomly pick a word to be your context word. Then, randomly pick a word within your window (plus or minus four words) to be your target word.

Model:

• Vocab size = 10,000 words
• Want to learn a mapping from some context c (“orange”) to some target t (“juice”)

Softmax : $$p(t|c) = \dfrac{e^{\theta_t^Te_c}}{\sum\limits_{j=1}^{10,000}e^{\theta_j^Te_c}}$$

$$\theta_t = \text{parameter associated with output t}$$

Loss function: $$\mathcal{L}(\hat{y}, y) = - \sum\limits_{i=1}^{10,000} y_i \log \hat{y}_i$$

Problems with softmax classification:

• You need to carry out a sum over your entire vocabulary every time you want to calculate a probability.
• use a hiearchical softmax classifier. Think of it as a decision tree with binary/logistic classifier. This scales with log of vocablary size, rather than linear scale with vocablary size.

In practice, $P( c)$ is not taken uniformly randomly. There are issues of getting a lot of common words ‘and, or, to’ etc. We choose words more likely to result in a better embedding matrix.

### Negative Sampling

The downside of the last step is the softmax step is slow to compute. This algorithm is much more efficient.

Define a new learning problem

• I want a glass of orange juice to go along with my cereal.

Given a pair of words, orange:juice, determien if it is a context target pair. orange:juice returns 1, while orange:king returns 0.

Pick a valid context/word pair, then pick a bunch of random variablse from the dictionary and then set them to be 0 (random words are usually not content linked.)

Define a logistic regression model.

$$P(y=1 | c, t) = \sigma (\theta_t^Te_c)$$

You are only updating K=1 binary classification probelms rather than updating a 10,000 array. This is called neagtive sampling because you have a positive example, yet you go out and generate a bunch of negative examples afterwards.

How do you choose the negative examples? After choosing the context word “orange”, how do you choose the negative examples? - One thing you can do is sample the candidate target words according to the imperial frequency of words in your corpus. (how often it appears) The problem is it gives you a bunch of words like “The, of, and, …”

Imperically, what they though to work best:

$$p(w_i) = \dfrac{f(w_i)^{3 / 4}}{\sum\limits_{j=1}^{10000} f(w_j)^{3 / 4}}$$

### GloVe word vectors

GloVe stands for “global vectors for word representation”.

“I want a glass of orange juice to go along with my cereal.”

$$x_{ij} = \text{ # times i appears in context of j}$$

• How often do words appear close with each other?

Minimize:

$$\sum\limits_{i=1}^{10,000} \sum\limits_{j=1}^{10,000} f(X_{ij})(\Theta_i^Te_j + b_i + b_j’ - \log{X_{ij}})^2$$ $$f(X_{ij}) = 0 \text{if} X_{ij} = 0$$ - $f$ accounts for $0\log{0} = 0$. - $f$ also accounts for frequent words (this, is, of, a) and infrequent words (durian)

Note on the featurization view of word embeddings

• features learned using these algorithms do not neatly translate to interperatable features like ‘gender’, or ‘royal’.

## Applications using Word Embeddings

### Sentiment Classification

Task of looking at a piece of text, telling if the text is “liked” or “disliked”.

• You may not have a huge labeled dataset.

Simple sentiment classification model

• Take your words as one hot vectors, multiply it by the embedding matrix to extract out your word’s embedding vector
• Averaging your embedding vectors, then take the softmax classifier’s output as your value
• cons: this ignores word order

RNN for sentiment classification

• Many to One architecture RNN that takes in your entire sequence and output a softmax output

### Debiasing word embeddings

How to deminish/eliminate bias (gender, race, etc.) in word embeddings.

• Man:Woman as King:Queen
• Man:Computer_Programmer as Woman:Homemaker (probably not right)
• Man:Computer_Programmer as Woman:Computer Programmer
• Father:Doctor as Mother:Nurse (although Doctor would have been better)

The biases picked up reflects the biases written by people. Difficult to scrub when you train on a lot of historical data.

1. Identify the bias direction.
2. Neutralize: For every word that is not definitional, project to get rid of bias.
3. Equalize pairs.
4. Authors trained a classifier to determine which words were definitional and which words were not definitional. This helped detect which words to neutralize (to project out bias direction).
5. Number of pairs to equalize is usually very small.