Taking the Coursera Machine Learning course. Will post condensed notes every week as part of the review process. All material originates from the free Coursera course, taught by Andrew Ng.

Assumes you have knowledge of Week 7.

Table of Contents

Unsupervised Learning



Unsupervised learning is the class of problem solving where when given a set of data with no labels, find structure in the dataset.


Clustering is good for problems like:

K-Means Algorithm

This is a clustering algorithm.

  1. Randomly initialize your cluster centroids.
  2. Cluster assignment step: Assign each example to a cluster centroid based on distance.
  3. Move centroid step: Take the centroids, move them to the average of all the assigned examples.
  4. Iterate through step 2 and 3 until convergence.

k_means_1 k_means_2 k_means_3 k_means_4 k_means_5

Formal Definition


$ x^{(i)} \in \mathbb{R}^n $ (drop $x_0 = 1$ convention)

Randomly initialize $K$ cluster centroids $\mu_1, \mu_2, \dots, \mu_k \in \mathbb{R}^{n}$

Repeat {


If there are clusters with no points assigned to it, it is common practice to remove that cluster. Alternatively, one may reinitialize the algorithm with new cluster centroids.

Optimization Objective

The K-Means cost function (optimization objective) is defined here:

$$ c^{(i)} = \text{ index of cluster (1, 2,}\dots, K\text{) to which example } x^{(i)} \text{ is currently assigned} $$ $$ \mu_k = \text{ cluster centroid } k \hspace{1em} (\mu_k \in \mathbb{R}^{n}) $$ $$ \mu_{c^{(i)}} = \text{ cluster centroid of cluster to which example } x^{(i)} \text{ has been assigned} $$

$$ J(c^{(1)}, \dots, c^{(m)}, \mu_1, \dots, \mu_K) = \dfrac{1}{m} \sum\limits_{i=1}^{m} || x^{(i)} - \mu_{c^{(i)}} || ^2 $$ $$ \min\limits_{c^{(1)}, \dots, c^{(m)}, \mu_1, \dots, \mu_K} J(c^{(1)}, \dots, c^{(m)}, \mu_1, \dots, \mu_K) $$

Random Initialization

How do you initialize the cluster centroids?


The K-Means algorithm may end up in local optima. The way to get around this is to run K-Means multiple times with multiple random initializations. A typical number of times to run K-Means is 50 - 1000 times. Compute the cost function J and pick the clustering that gives the lowest cost.

Choosing the Number of Clusters

Choosing the number of clusters $K$ in K-means is a non trivial problem as clusters may or may not be intuitive. Usually, it is still a manual step where an individual picks the number of clusters by looking at a plot.


Elbow Method


Usually K-Means is downstream purpose specific. For example, when calculating clusters for market segmentation, if we are selling T-Shirts, perhaps it is more useful to have pre-defined clusters “Small, Medium, Large” or “Extra Small, Medium, Large, Extra Large” sizes.

Dimensionality Reduction


Data Compression

There are two primary reasons to perform dimensionality reduction. One of them is data compression, and the other is that dimensionality reduction can increase performance of our learning algorithms.

Given a two features like length in inches and centemeters, with slight roundoff error, there is a lot of redundancy. It would be useful to convert a 2D plot into a 1D vector.


Before, we needed two numbers to represent an example. After compression, only one number is necessary to represent the example.

The typical example of dimensionality reduction is from 1000D to 100D.



Dimensionality Reduction also helps us visualize the data better. Suppose we have the following dataset:


We want to reduce the features to a two or three dimensional vector in order to better understand the data, rather than attempt to plot a 50 dimension table.


Principal Component Analysis

Principal Component Analysis Problem Formulation

PCA is the most popular algorithm to perform dimensionality reduction.


PCA is not linear regression


Principal Component Analysis Algorithm

  1. Data preprocessing step:

    • Given your training set $x^{(1)}, x^{(2)}, \dots, x^{(m)}$
    • Perform feature scaling and mean normalization $$ \mu_j = \dfrac{1}{m} \sum\limits_{i=1}^m x_j^{(i)} $$
    • Replace each $x_j^{(i)}$ with $ x_j - \mu_j $.
    • If different features on different scales, (e.g. size of house, number of bedrooms) scale features to have comparable range of values. $$ x_j^{(i)} \leftarrow \dfrac{x_j^{(i)} - \mu_j}{s_j} $$
  2. PCA algorithm

    • Reduce data from $n$-dimensions to $k$-dimensions
    • Compute the “covariance matrix”:
    • (it is unfortunate that the Sigma value is used, do not confuse with summation) $$ \Sigma = \dfrac{1}{m} \sum\limits_{i=1}^{n} (x^{(i)})(x^{(i)})^T $$
    • Compute the ‘eigenvectors’ of matrix $\Sigma$
  [ U, S, V ] = svd(Sigma)
  % svd stands for singular value decomposition
  % another function that does this is the eig(Sigma) function
  % svd(Sigma) returns an n * n matrix



Applying PCA

Reconstruction from Compressed Representation

After compression, how do we go back to the higher dimensional state?


Choosing the Number of Principal Components

In the PCA algorithm, how do we choose the value for $k$? How do we choose the number of principal components?

Recall that PCA tries to minimize the average squared projection error

$$ \dfrac{1}{m} \sum_{i=1}^{m} || x^{(i)} - x_{\text{approx}}^{(i)} ||^2 $$

Also, the total variation in the data is defined as

$$ \dfrac{1}{m} \sum_{i=1}^{m} || x^{(i)} || ^2 $$

Typically choose $k$ to be the smallest value such that

$$ \dfrac{ \dfrac{1}{m} \sum_{i=1}^{m} || x^{(i)} - x_{\text{approx}}^{(i)} || ^ 2 }{ \dfrac{1}{m} \sum_{i=1}^{m} || x^{(i)} || ^ 2 } \leq 0.01 $$ $$ \text{This means that 99% of variance is retained.} $$

Vary the percentage between 95-99 percent, depending on your application and requirements.

This is how to calculate the variance using the svd(Sigma) function return values:


Advice for Applying PCA

PCA can be applied to reduce your training set dimensions before feeding the resulting training set to a learning algorithm.

Only run PCA on the training set. Do not run PCA on the cross validation and test sets.

One bad use of PCA is to prevent overfitting. Use regularization instead. PCA throws away some information without knowing what the corresponding values of y are.

Do not unnecessarily run PCA. It is valid to run your learning algorithms using the raw data $x^{(i)}$ and only when that fails, implement PCA.

Move on to Week 9.