Taking the Coursera Machine Learning course. Will post condensed notes every week as part of the review process. All material originates from the free Coursera course, taught by Andrew Ng.

Assumes you have knowledge of Week 9.

• Lecture notes:

# Large Scale Machine Learning

## Gradient Descent with Large Datasets

### Learning With Large Datasets

One of the best ways to get a high performance machine learning system is to supply a lot of data into a low bias (overfitting) learning algorithm. The gradient descent algorithm can run very slowly if the training set is large, because a summation needs to be performed across all training examples to perform one step of gradient descent.

$$h_\theta(x) = \sum\limits_{j=0}^n \theta_jx_j$$ $$J_\text{train}(\theta) = \dfrac{1}{2m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$$ $$\theta_j := \theta_j - \alpha \dfrac{1}{m}\sum\limits_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$$

• What if $m$ were 100,000,000? $\dfrac{1}{m}\sum\limits_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$ becomes very expensive.
• Could we do a sanity check by running gradient descent with 1000 randomly selected examples?
• One way to verify this is to plot a learning curve for a range of values of m (say 100, 10,000, 1,000,000) and verify that the algorithm has high variance (overfitting) when m is small.

Rather than running gradient descent on the entire training set, one can run gradient descent on one training set at a time. The following is the Stochastic Gradient Descent algorithm:

$$\text{cost}(\theta, (x^{(i)}, y^{(i)})) = \dfrac{1}{2}(h_\theta(x^{(i)})-y^{(i)})^2$$ $$J_\text{train}(\theta) = \dfrac{1}{2m}\sum\limits_{i=1}^m\text{cost}(\theta, (x^{(i)}, y^{(i)}))$$

1. Randomly shuffle or reorder the dataset.
2. Repeat { for i = 1, …, m { $$\theta_j := \theta_j - \alpha(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$$ for j = 0, …, n } } Rather than waiting to sum up all of the training sets before taking a step, we can take a step on a single training example.

• Repeat the outer loop somewhere between 1 and 10 times. The inner loop would require iterating through all of your training examples.

Mini-Batch Gradient Descent is a variation of Stochastic Gradient Descent except rather than using a single example in each iteration, it uses $b$ examples in each iteration where $b$ is the mini-batch size. A typical choice for $b$ is 10, where $b$ ranges between 2-100.

b = 10 example:

$$(x^{(i)}, y^{(i)}), \dots, (x^{(i+9)}, y^{(i+9)})$$ $$\theta_j := \theta_j - \alpha\dfrac{1}{10} \sum\limits_{k=1}^{i+9}(h_\theta(x^{(k)})-y^{(k)})x_j^{(k)}$$ Increment $i$ by 10 and repeat until all training examples are used

This algorithm becomes the same as normal batch gradient descent if $b = m$.

Stochastic gradient descent does not converge nicely like Batch gradient descent. In Batch gradient descent, the cost function would decrease as the number of iterations of gradient descent increased. In Stochastic gradient descent, this is not certain.

$$\text{cost}(\theta, (x^{(i)}, y^{(i)})) = \dfrac{1}{2}(h_\theta(x^{(i)} - y^{(i)}))^2$$ During learning, compute $\text{cost}(\theta, (x^{(i)}, y^{(i)}))$ before updating $\theta$ using $(x^{(i)}, y^{(i)})$. Every 1000 iterations (approximately, depends on your use case), plot $\text{cost}(\theta, (x^{(i)}, y^{(i)}))$ averaged over the last 1000 examples processed by the algorithm.

The learning rate should be sufficiently small. Additionally, when the stochasti gradient descent nears a minima, one way to make it converge is to slowly make the learning rate $\alpha$ decrease over time.

One example of doing this is to make $\alpha = \dfrac{\text{const1}}{\text{iterationNumber} + \text{const2}}$. The constants are application dependent.