Taking the Coursera Deep Learning Specialization, Sequence Models course. Will post condensed notes every week as part of the review process. All material originates from the free Coursera course, taught by Andrew Ng. See deeplearning.ai for more details.

Table of Contents

Natural Langauge Processing & Word Embeddings

Introduction to Word Embeddings

Word Representation


It would be better if each word could be represented by features.

For instance, a word could have a gender associated with it.

The word man could have gender -1 and the word woman could have gender +1 while a word like apple could have gender 0.


Notation is $e_5391$ where the subscript value is the original one hot vector index, but $e$ is referring to the feature vector instead of the one-hot vector.

It’s common to visualize word embeddings in a 2D plane (using an algorithm like t-SNE). These are called embeddings, as each word is applied to a point in a multi-dimensional space (each point has it’s own space). T-SNE allows you to visualize this in a lower diemsntional space.


Using word embeddings

Named entity recognition, trying to detect people’s names in a sentence.


Word embeddings can look at 1B to 100B words. (this is common) Training set can be around 100K words.

This knowledge can be transferred to named entity recognition, as you can train your neural network’s word embeddings on text found on the internet.

  1. Learn word embeddings from a large text corpus (1-100B words). Or you can downloda pre-trained embedding online
  2. Transfer embedding to new task with a smaller training set (say 100k words.) Rather than using a 10,000 one hot vector, you can now use a 300 dimension dense vector.
  3. Optional: Continue to fine tune the word embeddings with new data.

Finally, word embeddings have an interesting relationship to face encoding.For face recognition, recall the siamese network training for generating encoding for the input image’s face. The image encoding is similar to the word embedding, however word embeddings usually have a fixed size dictionary and a unknown variable.


Properties of word embeddings

Suppose that you are given the question “Man is to woman, as King is to ?”

Is it possible to have the neural network answer this question using the word embeddings? Yes it is. One interesting property of word embeddings is that the you can subtract the vector $e_{\text{man}} - e_{\text{woman}}$ you can compare that to the vector $e_{\text{king}} - e_{\text{queen}}$.


Your algorithm can first subtract the vectors man/woman to calculate the difference for finding a similar analogy for king to ?.

In pictures, perhaps the word embedding is in 300 dimensional space. The vector difference between Man and Woman is very similar to the vector difference between King and Queen. The arrow difference in the slide below represents a difference in gender.

Try to find the word $w$ such that the following equation holds true. $$ e_{\text{man}} - e_{\text{woman}} \approx e_{\text{king}} - e_{\text{queen}}$$

Find word w: $ \text{arg} \max\limits_w \text{similarity}(e_w, e_{\text{king}} - e_{\text{man}} + e_{\text{woman}})$

If you learn a set of word embeddings, you can find analogies using word vectors with decent accuracy.


The most commonly used similarity function is Cosine similarity.

$$\text{similarity}(u, v) = \dfrac{u^Tv}{||u||_2||v||_2} $$

Can also use euclidian distance. $ ||u-v||^2 $.

Things it can learn:

Embedding matrix

When you implement an algorithm to learn an embedding, you’re learning an embedding matrix.

Take for instance, a 10,000 word vocaulary. [a, aaron, orange, ... zulu, <UNK>]

The let’s make this matrix E = 300 by 10,000. If Orange was indexed 6257,

$O_{6257}$ is the one hot vector of 10,000 rows with a 1 at the 6257th position.


$$ E * O_j = e_j $$

Initialize E randomlly, then use gradient descent to learn the parameters of the embedding matrix.

In practice, you use a specialized function to do the multiplication, as matrix multiplication is innefficient with many one hot vectors. Keras has an embedding module that does this for you.

Learning Word Embeddings: Word2vec & GloVe

Learning word embeddings

Lets say you’re building a language model. Building a neural language model is a reasonable way to learn word embeddings.


You can have various context and target pairs.



Recall the above Skip-grams. Let’s say you’re given the sentence “I want a glass of orange juice to go along with my cereal”. Rather than having context be the immediate last word, randomly pick a word to be your context word. Then, randomly pick a word within your window (plus or minus four words) to be your target word.




Softmax : $$ p(t|c) = \dfrac{e^{\theta_t^Te_c}}{\sum\limits_{j=1}^{10,000}e^{\theta_j^Te_c}} $$

$$ \theta_t = \text{parameter associated with output t} $$

Loss function: $$ \mathcal{L}(\hat{y}, y) = - \sum\limits_{i=1}^{10,000} y_i \log \hat{y}_i $$

Problems with softmax classification:


In practice, $ P( c)$ is not taken uniformly randomly. There are issues of getting a lot of common words ‘and, or, to’ etc. We choose words more likely to result in a better embedding matrix.

Negative Sampling

The downside of the last step is the softmax step is slow to compute. This algorithm is much more efficient.

Define a new learning problem

Given a pair of words, orange:juice, determien if it is a context target pair. orange:juice returns 1, while orange:king returns 0.


Pick a valid context/word pair, then pick a bunch of random variablse from the dictionary and then set them to be 0 (random words are usually not content linked.)

Define a logistic regression model.

$$ P(y=1 | c, t) = \sigma (\theta_t^Te_c) $$


You are only updating K=1 binary classification probelms rather than updating a 10,000 array. This is called neagtive sampling because you have a positive example, yet you go out and generate a bunch of negative examples afterwards.

How do you choose the negative examples? After choosing the context word “orange”, how do you choose the negative examples? - One thing you can do is sample the candidate target words according to the imperial frequency of words in your corpus. (how often it appears) The problem is it gives you a bunch of words like “The, of, and, …”

Imperically, what they though to work best:

$$ p(w_i) = \dfrac{f(w_i)^{3 / 4}}{\sum\limits_{j=1}^{10000} f(w_j)^{3 / 4}} $$


GloVe word vectors

GloVe stands for “global vectors for word representation”.

“I want a glass of orange juice to go along with my cereal.”

$$ x_{ij} = \text{ # times i appears in context of j} $$




$$ \sum\limits_{i=1}^{10,000} \sum\limits_{j=1}^{10,000} f(X_{ij})(\Theta_i^Te_j + b_i + b_j’ - \log{X_{ij}})^2 $$ $$ f(X_{ij}) = 0 \text{if} X_{ij} = 0 $$ - $f$ accounts for $0\log{0} = 0 $. - $f$ also accounts for frequent words (this, is, of, a) and infrequent words (durian)

Note on the featurization view of word embeddings


Applications using Word Embeddings

Sentiment Classification

Task of looking at a piece of text, telling if the text is “liked” or “disliked”.


Simple sentiment classification model


RNN for sentiment classification


Debiasing word embeddings

How to deminish/eliminate bias (gender, race, etc.) in word embeddings.


The biases picked up reflects the biases written by people. Difficult to scrub when you train on a lot of historical data.


  1. Identify the bias direction.
  2. Neutralize: For every word that is not definitional, project to get rid of bias.
  3. Equalize pairs.
  4. Authors trained a classifier to determine which words were definitional and which words were not definitional. This helped detect which words to neutralize (to project out bias direction).
  5. Number of pairs to equalize is usually very small.