Taking the Coursera Machine Learning course. Will post condensed notes every week as part of the review process. All material originates from the free Coursera course, taught by Andrew Ng.

Assumes you have knowledge of Week 8.

• Lecture notes:

Anomoly Detection

Density Estimation

Problem Motivation

Imagine being a manufacturor of aircraft engines. Measure heat generated, vibration intensity, etc. You know have a dataset of these features which you can plot. Anomoly detection would be determining if a new aircraft engine $x_{\text{test}}$ is anomolous in relation to the previously measured engine features.

Given a dataset $x^{(1)}, x^{(2)}, \dots, x^{(m)}$, is $x_{\text{test}}$ anomalous? Check if $p(x) \gt \epsilon$ for given test set.

Some useful cases include fraud detection, manufacturing, monitoring computers in a data center, etc.

Suppose the anomaly detection system flags $x$ as anomalous whenever $p(x) \leq \epsilon$. It is flagging too many things as anomalous that are not actually so. The corrective step in this case would be to decrease the value of $\epsilon$.

Gaussian Distribution

The Gaussian Distribution (Normal Distribution), where $x \in \mathbb{R}$, has mean $\mu$ and variance $\sigma^2$:

$$x \thicksim \mathcal{N}(\mu, \sigma^2)$$ (The tilde means “distributed as”, Script ‘N’ stands for normal distribution

$$p(x;\mu, \sigma^2) = \dfrac{1}{\sqrt{2\pi}\sigma} \times \exp{(-\dfrac{(x-\mu)^2}{2\sigma^2})}$$

Red shaded area is equal to 1.

To calculate the average $\mu$ and variance $\sigma^2$ we use the following formulae:

$$\mu = \dfrac{1}{m} \sum\limits_{i=1}^{m}x^{(i)}$$ $$\sigma^2 = \dfrac{1}{m} \sum\limits_{i=1}^{m}(x^{(i)} - \mu)^2$$

Algorithm

Algorithm for density estimation. Given a training set {$x^{(1)}, \dots, x^{(m)}$} where each example is $x \in \mathbb{R}^n$

$$p(x) = p(x_1;\mu_1,\sigma^2_1)p(x_2;\mu_2,\sigma^2_2)p(x_3;\mu_3,\sigma^2_3)\dots p(x_n;\mu_n,\sigma^2_n)$$

where $x_1 \thicksim \mathcal{N}(\mu_1, \sigma^2_1)$,$x_2 \thicksim \mathcal{N}(\mu_2, \sigma^2_2)$, and so on.

More conscicely, this algorithm can be written as:

$$p(x) = \prod\limits_{j=1}^n p(x_j;\mu_j, \sigma^2_j)$$

The capital $\Pi$ is the product symbol, it is similar to the $\sum$ function except rather than adding, it performs multiplication.

1. Choose features $x_i$ that you think might be indicative of anomalous examples.

2. Fit parameters $\mu_1, \mu_2, \dots, \mu_n; \sigma_1^2, \sigma_2^2, \dots, \sigma_n^2$

3. Given new example $x$, compute $p(x)$. The example is an anomaly if $p(x) < \epsilon$.

Building an Anomaly Detection System

Developing and Evaluating an Anomaly Detection System

The importance of real-number evaluation, when developing a learning algorithm, learning decisions is much easier if we have a way of evaluating our learning algorithm. Assume we have some labeled data of anomalous and non-anomalous examples. ($y = 0$ if normal and $y = 1$ as anomalous).

Training set: $x^{(1)}, x^{(2)}, \dots, x^{(m)}$ (assume that the training set is normal and not anomalous)

Cross validation set: $(x_{\text{cv}}^{(1)}, y_{\text{cv}}^{(1)}), \dots, (x_{\text{cv}}^{(m_{\text{cv}})}, y_{\text{cv}}^{(m_{\text{cv}})})$

Test set: $(x_{\text{test}}^{(1)}, y_{\text{test}}^{(1)}), \dots, (x_{\text{test}}^{(m_{\text{test}})}, y_{\text{test}}^{(m_{\text{test}})})$

The following would be a reccomended split of training sets and cross validation sets for an aircraft engine monitoring example:

One can evaluate the algoirthm by using precision and recall, or the $F_1$ score.

Anomaly Detection vs. Supervised Learning

Anomaly detection can be used if there are a very small number of positive examples ($y=1$, a range between zero and twenty is common). Anomaly detection should also have a large number of negative ($y=0$) examples. Anomalies should have many types, it’s hard for any algorithm to learn what anomalies look like, as future anomalies may look nothing like what we have seen so far.

• Fraud Detection
• Manufacturing (eg aircraft engines)
• Monitoring machines in a data center

Supervised learning should be used when there are a large number of positive and negative ezamples. There are enough positive examples for the algorithm to get a sense of what positive examples are like. Future positive examples are likely to be similar to the ones in the training set.

• Email/Spam classification
• Weather prediction (sunny/rainy/etc)
• Cancer classification

Choosing What Features to Use

One thing that we have done was plotting the features to see if the features fall into a normal (gaussian) distribution. Given a dataset that does not look gaussian, it might be useful to transform the features to look more gaussian. There are multiple different functions one can play with to make the data look more gaussian.

Error analysis for anomaly detection- we want $p(x)$ to be large for normal examples and $p(x)$ to be small for anomalous examples.

How do we fix the common problem where $p(x)$ is comparable for both normal and anomalous examples? This is still a manual process, look at the anomalous examples and distinguish features that make the irregular example anomalous.

Choose features that might take on unusually large or small values in the event of an anomaly.

Multivariate Gaussian Distribution

Algorithm

Multivariate Gaussian Distribution can be useful if there are correlation between the features that need to be accounted for when determining anomalies.

It may be useful to not model $p(x_1), p(x_2)$ separately. Model $p(x)$ all in one go. Parameters are $\mu \in \mathbb{R}^n$, $\Sigma \in \mathbb{R}^{n\times n}$.

Here are some examples of multivariate gaussian examples with varying sigma:

To perform parameter fitting, plug in the follwing formula:

Reccomender Systems

Predicting Movie Ratings

Problem Forumulation

Imagine you are a company that sells or rents out movies. You allow users to rate different movies from zero to five stars. You have a matrix of users, their sparsely populated rated movies, and you want to find out what they would also rate similarly according to existing data.

Content Based Recommendations

If the movies have features associated to the movie, they can be represented with a feature vector.

To learn $\theta^{(j)}$, refer to the following problem formulation:

$$\theta^{(j)} = \min\limits_{\theta^{(j)}} \dfrac{1}{2} \sum\limits_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \dfrac{\lambda}{2} \sum\limits_{k=1}^n (\theta_k^{(j)})^2$$

To learn $\theta^{(1)}, \theta^{(2)}, \dots, \theta^{(n_u)}$:

$$J(\theta^{(1)}, \dots, \theta^{(n_u)}) = \min\limits_{\theta^{(1)}, \dots, \theta^{(n_u)}} \dfrac{1}{2} \sum\limits_{j=1}^{n_u} \sum\limits_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \dfrac{\lambda}{2} \sum\limits_{j=1}^{n_u} \sum\limits_{k=1}^n (\theta_k^{(j)})^2$$

$$\theta_k^{(j)} := \theta_k^{(j)} - \alpha \sum\limits_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)} \hspace{1em} \text{for } k=0$$ $$\theta_k^{(j)} := \theta_k^{(j)} - \alpha (\sum\limits_{i:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)} + \lambda\theta_k^{(j)} ) \hspace{1em} \text{for } k \neq 0$$

Collaborative Filtering

Collaborative Filtering Algorithm

This algorithm learns what features to use for an existing data set. It can be very difficult to determine how ‘romantic’ a movie is, or how much ‘action’ a movie has. Suppose we have a data set where we do not know the features of our movies.

The assumption here is that the users pre-specify what genres of movies they like.

Given $\theta^{(1)}, \dots, \theta^{(n_u)}$ to learn $x^{(i)}$:

$$x^{(i)} = \min\limits_{x^{(i)}} \dfrac{1}{2} \sum\limits_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \dfrac{\lambda}{2}\sum\limits_{k=1}^{n}(x_k^{(i)})^2$$

Given $\theta^{(1)}, \dots, \theta^{(n_u)}$ to learn $x^{(1)}, \dots, x^{(n_m)}$:

$$\min\limits_{x^{(i)}} \dfrac{1}{2} \sum\limits_{i=1}^{n_m} \sum\limits_{j:r(i,j)=1} ((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \dfrac{\lambda}{2} \sum\limits_{i=1}^{n_m} \sum\limits_{k=1}^{n}(x_k^{(i)})^2$$

1. Initialize $x^{(1)}, \dots, x^{(n_m)}$ and $\theta^{(1)}, \dots, \theta^{(n_u)}$ to random small values.

2. Minimize $J(x^{(1)}, \dots, x^{(n_m)}, \theta^{(1)}, \dots, \theta^{(n_u)})$ using gradient descent (or an advanced optimization algorithm).

3. For a user with parameters $\theta$ and a movie with (learned) features $x$ predict a star rating of $\theta^Tx$.

Low Rank Matrix Factorization

Vectorization: Low Rank Matrix Factorization

The vectorized implementation for the reccomender system can be visualized as the following:

Movies can be related if the feature vectors between the movies are small.

Implementational Detail: Mean Normalization

For users who have not rated any movies, the only term that effects the user is the regularization term.

This does not help us perform reccomendations as the value of all predicted stars will be 0 for the user who has not rated any movies. One way to address this is to apply mean normalization to the input data and pretend the normalized data is the new data to perform prediction with.

This allows us to perform reccomendations even though a user has not rated any movies, because we have the average rating of a movie based on all users.

Move on to Week 10.