Taking the Coursera Machine Learning course. Will post condensed notes every week as part of the review process. All material originates from the free Coursera course, taught by Andrew Ng.

Assumes you have knowledge of Week 7.

• Lecture notes:

# Unsupervised Learning

## Clustering

### Introduction

Unsupervised learning is the class of problem solving where when given a set of data with no labels, find structure in the dataset.

Clustering is good for problems like:

• Market segmentation (Create groups for your potential customers)
• Social Network analysis (analyze groups of friends)
• Organize computing clusters (arrange servers in idea locations to one another)
• Astronomical data analysis (Understand groupings of stars)

### K-Means Algorithm

This is a clustering algorithm.

1. Randomly initialize your cluster centroids.
2. Cluster assignment step: Assign each example to a cluster centroid based on distance.
3. Move centroid step: Take the centroids, move them to the average of all the assigned examples.
4. Iterate through step 2 and 3 until convergence.

Formal Definition

Input:

• K (number of clusters)
• Training set $\{ x^{(1)}, x^{(2)}, \dots, x^{(m)} \}$

$x^{(i)} \in \mathbb{R}^n$ (drop $x_0 = 1$ convention)

Randomly initialize $K$ cluster centroids $\mu_1, \mu_2, \dots, \mu_k \in \mathbb{R}^{n}$

Repeat {

• Cluster Assignment Step
• for $i = 1$ to $m$, $c^{(i)} :=$ index (from $1$ to $K$) of cluster centroid closest to $x^{(i)}$
• Move Centroid Step
• for $k = 1 \text{to} K$, $\mu_k :=$ average (mean) of points assigned to cluster $k$

}

If there are clusters with no points assigned to it, it is common practice to remove that cluster. Alternatively, one may reinitialize the algorithm with new cluster centroids.

### Optimization Objective

The K-Means cost function (optimization objective) is defined here:

$$c^{(i)} = \text{ index of cluster (1, 2,}\dots, K\text{) to which example } x^{(i)} \text{ is currently assigned}$$ $$\mu_k = \text{ cluster centroid } k \hspace{1em} (\mu_k \in \mathbb{R}^{n})$$ $$\mu_{c^{(i)}} = \text{ cluster centroid of cluster to which example } x^{(i)} \text{ has been assigned}$$

$$J(c^{(1)}, \dots, c^{(m)}, \mu_1, \dots, \mu_K) = \dfrac{1}{m} \sum\limits_{i=1}^{m} || x^{(i)} - \mu_{c^{(i)}} || ^2$$ $$\min\limits_{c^{(1)}, \dots, c^{(m)}, \mu_1, \dots, \mu_K} J(c^{(1)}, \dots, c^{(m)}, \mu_1, \dots, \mu_K)$$

### Random Initialization

How do you initialize the cluster centroids?

• Pick a number of clusters less than the number of examples you have
• should have $K \lt m$
• Randomly pick $K$ training examples
• Set $\mu_1, \dots, \mu_K$ equal to these $K$ examples

The K-Means algorithm may end up in local optima. The way to get around this is to run K-Means multiple times with multiple random initializations. A typical number of times to run K-Means is 50 - 1000 times. Compute the cost function J and pick the clustering that gives the lowest cost.

### Choosing the Number of Clusters

Choosing the number of clusters $K$ in K-means is a non trivial problem as clusters may or may not be intuitive. Usually, it is still a manual step where an individual picks the number of clusters by looking at a plot.

• are there 2, 3, or 4 clusters? It’s ambiguous

Elbow Method

• Run K-Means with varying number of clusters. (1 cluster, then 2 clusters, then 3… so on and so on)
• Ends up with a curve showing how distortion decreases as the number of clusters increases

• usually the ‘elbow’ is not clearly defined

Usually K-Means is downstream purpose specific. For example, when calculating clusters for market segmentation, if we are selling T-Shirts, perhaps it is more useful to have pre-defined clusters “Small, Medium, Large” or “Extra Small, Medium, Large, Extra Large” sizes.

# Dimensionality Reduction

## Motivation

### Data Compression

There are two primary reasons to perform dimensionality reduction. One of them is data compression, and the other is that dimensionality reduction can increase performance of our learning algorithms.

Given a two features like length in inches and centemeters, with slight roundoff error, there is a lot of redundancy. It would be useful to convert a 2D plot into a 1D vector.

Before, we needed two numbers to represent an example. After compression, only one number is necessary to represent the example.

The typical example of dimensionality reduction is from 1000D to 100D.

### Visualization

Dimensionality Reduction also helps us visualize the data better. Suppose we have the following dataset:

We want to reduce the features to a two or three dimensional vector in order to better understand the data, rather than attempt to plot a 50 dimension table.

## Principal Component Analysis

### Principal Component Analysis Problem Formulation

PCA is the most popular algorithm to perform dimensionality reduction.

• Find a lower dimensional surface such that the sum of squares error (projection error) is minimized.
• Standard practice is to perform feature scalining and mean normalization before scaling the data.

PCA is not linear regression

• In linear regression, we are minimizing the point and the value predicted b the hypothesis
• In PCA, we are minimizing the distance between the point and the line

### Principal Component Analysis Algorithm

1. Data preprocessing step:

• Given your training set $x^{(1)}, x^{(2)}, \dots, x^{(m)}$
• Perform feature scaling and mean normalization $$\mu_j = \dfrac{1}{m} \sum\limits_{i=1}^m x_j^{(i)}$$
• Replace each $x_j^{(i)}$ with $x_j - \mu_j$.
• If different features on different scales, (e.g. size of house, number of bedrooms) scale features to have comparable range of values. $$x_j^{(i)} \leftarrow \dfrac{x_j^{(i)} - \mu_j}{s_j}$$
2. PCA algorithm

• Reduce data from $n$-dimensions to $k$-dimensions
• Compute the “covariance matrix”:
• (it is unfortunate that the Sigma value is used, do not confuse with summation) $$\Sigma = \dfrac{1}{m} \sum\limits_{i=1}^{n} (x^{(i)})(x^{(i)})^T$$
• Compute the ‘eigenvectors’ of matrix $\Sigma$
  [ U, S, V ] = svd(Sigma)
% svd stands for singular value decomposition
% another function that does this is the eig(Sigma) function
% svd(Sigma) returns an n * n matrix

## Applying PCA

### Reconstruction from Compressed Representation

After compression, how do we go back to the higher dimensional state?

### Choosing the Number of Principal Components

In the PCA algorithm, how do we choose the value for $k$? How do we choose the number of principal components?

Recall that PCA tries to minimize the average squared projection error

$$\dfrac{1}{m} \sum_{i=1}^{m} || x^{(i)} - x_{\text{approx}}^{(i)} ||^2$$

Also, the total variation in the data is defined as

$$\dfrac{1}{m} \sum_{i=1}^{m} || x^{(i)} || ^2$$

Typically choose $k$ to be the smallest value such that

$$\dfrac{ \dfrac{1}{m} \sum_{i=1}^{m} || x^{(i)} - x_{\text{approx}}^{(i)} || ^ 2 }{ \dfrac{1}{m} \sum_{i=1}^{m} || x^{(i)} || ^ 2 } \leq 0.01$$ $$\text{This means that 99% of variance is retained.}$$

Vary the percentage between 95-99 percent, depending on your application and requirements.

This is how to calculate the variance using the svd(Sigma) function return values:

Do not unnecessarily run PCA. It is valid to run your learning algorithms using the raw data $x^{(i)}$ and only when that fails, implement PCA.