Taking the Coursera Machine Learning course. Will post condensed notes every week as part of the review process. All material originates from the free Coursera course, taught by Andrew Ng.

Assumes you have knowledge of Week 6.

Table of Contents

Support Vector Machines

Large Margin Classification

Optimization Objective

We are simplifying the logistic regression cost function by converting the sigmoid function into two straight lines, as shown here:


The following are two cost functions for support vector machines:

$$ \min\limits_{\theta} \dfrac{1}{m} [\sum\limits_{i=1}^m y^{(i)} \text{cost}_1(\theta^Tx^{(i)}) + (1-y^{(i)}) \text{cost}_0(\theta^Tx^{(i)}) ] + \dfrac{\lambda}{2m} \sum\limits_{j=1}^n \theta_j^2 $$ $$ \min\limits_{\theta} C[ \sum\limits_{i=1}^m y^{(i)} \text{cost}_1(\theta^Tx^{(i)}) + (1-y^{(i)}) \text{cost}_0(\theta^Tx^{(i)}) ] + \dfrac{1}{2} \sum\limits_{j=1}^{n} \theta^2_j $$

They both give the same value of $\theta$ if $C = \dfrac{1}{\lambda} $.

Hypothesis will predict:

$$ h_\theta(x) = 1 \hspace{1em} \text{if} \hspace{1em} \theta^Tx \geq 0 $$ $$ h_\theta(x) = 0 \hspace{1em} \text{otherwise} $$

Large Margin Intuition

Support Vector Machines are also known as Large Margin Classifiers. This is because when plotting the positive and negative examples, a support vector machine will draw a decision boundary with large margins:


This is different than linear regression, where the decision boundary can be very close to the positive and negative examples (due to $\theta^Tx \approx 0$ in the $y=1 \text{ or } y=0$ cases. )


When the data is not linearly sepearable, one should take into consideration the regularization parameter $C$.



In a non linear decision boundary, we can have many choices for high order polynomials. A Kernel is, given $x$, compute a new feature depending on proximity to landmarks $l^{(1)}, l^{(2)}, l^{(3)} $

Given the example $x$: $$ f_1 = \text{ similarity}(x, l^{(1)}) = \exp(-\dfrac{|| x - l^{(1)} ||^2}{2 \sigma ^2}) $$ $$ f_2 = \text{ similarity}(x, l^{(2)}) = \exp(-\dfrac{|| x - l^{(2)} ||^2}{2 \sigma ^2}) $$ $$ || x - l^{(1)} ||^2 = \text{ square of the euclidian distance between x and l}^{(1)} $$

These functions are kernels, these specific ones are gaussian kernels. Consider them similarity functions.

If $x \approx l^{(1)}$ then $ f_1 \approx \exp(-\dfrac{0^2}{2\sigma^2}) \approx 1 $.

If $x$ is far from $l^{(1)}$ then $ f_1 = \exp(-\dfrac{\text{large number}^2}{2\sigma^2}) \approx 0 $

The smaller the sigma, the feature falls more rapidly.


How do we choose the landmarks $l$?

Given $m$ training examples, set $l$ to be each one of your training examples.


The following is how you would train using kernels (similarity functions):


When using an SVM, one of the choices that need to be made is $C$. Also, one must consider the choice of $\sigma^2$. The following is the bias/variance tradeoff diagrams.


Source Vector Machines (in Practice)

When to choose between a linear kernel and a gaussian kernel?


Note: When using the Gaussian kernel, it is important to perform feature scaling beforehand.

The kernels that you choose must satisfy a technical condition called “Mercer’s Theorem” to make sure SVM packages’ optimizations run correctly and do not diverge.

Logistic regression vs Source Vector Machines

Neural networks are likely to work well for msot of these settings, but may be slower to train.

Move on to Week 8.