Taking the Coursera Machine Learning course. Will post condensed notes every week as part of the review process. All material originates from the free Coursera course, taught by Andrew Ng.

Assumes you have knowledge of Week 6.

Table of Contents

Support Vector Machines

Large Margin Classification

Optimization Objective

We are simplifying the logistic regression cost function by converting the sigmoid function into two straight lines, as shown here:

svm_cost1_cost0

The following are two cost functions for support vector machines:

$$ \min\limits_{\theta} \dfrac{1}{m} [\sum\limits_{i=1}^m y^{(i)} \text{cost}_1(\theta^Tx^{(i)}) + (1-y^{(i)}) \text{cost}_0(\theta^Tx^{(i)}) ] + \dfrac{\lambda}{2m} \sum\limits_{j=1}^n \theta_j^2 $$ $$ \min\limits_{\theta} C[ \sum\limits_{i=1}^m y^{(i)} \text{cost}_1(\theta^Tx^{(i)}) + (1-y^{(i)}) \text{cost}_0(\theta^Tx^{(i)}) ] + \dfrac{1}{2} \sum\limits_{j=1}^{n} \theta^2_j $$

They both give the same value of $\theta$ if $C = \dfrac{1}{\lambda} $.

Hypothesis will predict:

$$ h_\theta(x) = 1 \hspace{1em} \text{if} \hspace{1em} \theta^Tx \geq 0 $$ $$ h_\theta(x) = 0 \hspace{1em} \text{otherwise} $$

Large Margin Intuition

Support Vector Machines are also known as Large Margin Classifiers. This is because when plotting the positive and negative examples, a support vector machine will draw a decision boundary with large margins:

large_margin_classifier

This is different than linear regression, where the decision boundary can be very close to the positive and negative examples (due to $\theta^Tx \approx 0$ in the $y=1 \text{ or } y=0$ cases. )

svm_vs_linear_regression

When the data is not linearly sepearable, one should take into consideration the regularization parameter $C$.

svm_outliers

Kernels

In a non linear decision boundary, we can have many choices for high order polynomials. A Kernel is, given $x$, compute a new feature depending on proximity to landmarks $l^{(1)}, l^{(2)}, l^{(3)} $

Given the example $x$: $$ f_1 = \text{ similarity}(x, l^{(1)}) = \exp(-\dfrac{|| x - l^{(1)} ||^2}{2 \sigma ^2}) $$ $$ f_2 = \text{ similarity}(x, l^{(2)}) = \exp(-\dfrac{|| x - l^{(2)} ||^2}{2 \sigma ^2}) $$ $$ || x - l^{(1)} ||^2 = \text{ square of the euclidian distance between x and l}^{(1)} $$

These functions are kernels, these specific ones are gaussian kernels. Consider them similarity functions.

If $x \approx l^{(1)}$ then $ f_1 \approx \exp(-\dfrac{0^2}{2\sigma^2}) \approx 1 $.

If $x$ is far from $l^{(1)}$ then $ f_1 = \exp(-\dfrac{\text{large number}^2}{2\sigma^2}) \approx 0 $

The smaller the sigma, the feature falls more rapidly.

kernel_sigma

How do we choose the landmarks $l$?

Given $m$ training examples, set $l$ to be each one of your training examples.

kernel_landmarks

The following is how you would train using kernels (similarity functions):

kernel_training

When using an SVM, one of the choices that need to be made is $C$. Also, one must consider the choice of $\sigma^2$. The following is the bias/variance tradeoff diagrams.

kernel_bias_variance_tradeoff

Source Vector Machines (in Practice)

When to choose between a linear kernel and a gaussian kernel?

which_kernel_to_use

Note: When using the Gaussian kernel, it is important to perform feature scaling beforehand.

The kernels that you choose must satisfy a technical condition called “Mercer’s Theorem” to make sure SVM packages’ optimizations run correctly and do not diverge.

Logistic regression vs Source Vector Machines

Neural networks are likely to work well for msot of these settings, but may be slower to train.


Move on to Week 8.