Taking the Coursera Deep Learning Specialization, Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization course. Will post condensed notes every week as part of the review process. All material originates from the free Coursera course, taught by Andrew Ng. See deeplearning.ai for more details.

Assumes you have knowledge of Improving Deep Neural Networks, Week 1.

Table of Contents

Optimization Algorithms

Mini-Batch Gradient Descent

Rather than training on your entire training set during each step of gradient descent, break out your examples into groups.

For instance, if you had 5,000,000 training examples, it might be useful to do 5000 batches of 1000 examples each.

New notation for each training batch should use curly brace super scripts.

$$ X^{\{t\}}, Y^{\{t\}} $$

mini_batch_gradient_descent

Using Mini-Batch Gradient Descent (5000 batches of 1000 examples each)

For each batch, perform forward prop on $ X^{\{t\}} $.

Then compute the cost:

$$ J^{\{t\}} = \dfrac{1}{1000} \sum\limits^{l}_{i=1}\mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2 * 1000} \sum ||W^{[l]}||^2_F $$

Backprop to compute the gradient with respect to $J^{\{t\}} $ (using $X^{\{t\}}, Y^{\{t\}}$)

One epoch means running through your entire training set.

Understanding Mini-batch Gradient Descent

Training using mini-batch gradient descent does not show a smooth curve with constantly decreasing cost. It is a little bit more jumpy. It should show a slightly nosier trend downward.

training_with_mini_batch_gradient_descent

If your mini-batch size is $m$, then you’re just doing Batch gradient descent. This has the problem of taking too long per iteration.

If your mini-batch size is 1, then you’re doing Stochastic gradient descent. Every example is its own mini-batch. Stochastic gradient descent never converges. This has the problem of losing all of the speedup from vectorization.

In practice, the mini-batch size used is somewhere inbetween 1 and m. This should give you the fastest learning. (Bonuses of Vectorization while making progress without needing to wait until the entire training set is processed.)

Takeaways

If you have a small training set (m < 2000), just use batch gradient descent.

Typical mini-batch sizes are 64-1024 (make it a power of 2).

Make sure your mini-batch fits in CPU/GPU memory! ($ X^{\{t\}}, Y^{\{t\}} $)

Exponentially Weighted Averages

exponentially_weighted_averages

$$ V_t = \beta V_{t-1} + (1-\beta)\theta_t $$ $$ \beta = 0.9 \approx \text{ 10 days average temperature} $$

weighted_averages_high_beta

$$ \beta = 0.98 \approx \text{ 50 days average temperature} $$

weighted_averages_low_beta

$$ \beta = 0.5 \approx \text{ 2 days average temperature} $$

Understanding Exponentially Weighted Averages

Exponentially weighted averages are a way to sum the averages of all previous values.

understanding_weighted_averages

$$ V_{100} = 0.1\theta_{100} + 0.1 \cdot 0.9 \cdot \theta_{99} + 0.1 \cdot 0.9^{2} \cdot \theta_{98} + 0.1 \cdot 0.9^{3} \cdot \theta_{97} + \dots $$

Bias Correction in Exponentially Weighted Averages

Slight bias occurs because $V_0 = 0$. Therefore, you get a curve resembling the purple line.

bias_correction

Bias Correction by using $\dfrac{V_t}{1-\beta^{t}}$.

Gradient Descent with Momentum

Combine the weighted averages with gradient descent.

gradient_descent_with_momentum

Implementation details

On iteration $t$:

Compute $dW, db$ on the current mini-batch

$$ v_{dW} = \beta v_{dW} + (1-\beta)dW $$ $$ v_{db} = \beta v_{db} + (1-\beta)db $$ $$ W = W - \alpha v_{dW}, b = b - \alpha v_{db} $$

RMSprop

Root Means Squared prop.

rmsprop

Adam Optimization Algorithm

Initialize $V_{dW}=0, S_{dW}=0, V_{db}=0, S_{db}=0$.

On iteration t:

Compute dW, db using mini-batch

Momentum $\beta_1$ $$ V_{dW} = \beta_{1}V_{dW} + (1-\beta_{1})dW $$ $$ \hspace{1em} V_{db} =\beta_{1}V_{db} + (1-\beta_1)db $$

$$ V^{\text{corrected}}_{dW} = V_{dW}/(1-\beta_{1}^t) $$ $$ \hspace{1em} V^{\text{corrected}}_{db} = V_{db}/(1-\beta_1^t) $$

RMSprop $\beta_2$ $$ S_{dW} = \beta_{2}S_{dW} + (1-\beta_{2})dW^2 $$ $$ \hspace{1em} S_{db} = \beta_{2}S_{db} + (1-\beta_2)db $$

$$ S^{\text{corrected}}_{dW} = S_{dW}/(1-\beta_2^t) $$ $$ S^{\text{corrected}}_{db} = S_{db}/(1-\beta_2^t) $$

Finally

$$ W := W - \alpha \dfrac{V^{\text{corrected}}_{dW}}{\sqrt{S^{\text{corrected}}_{dW}}+\epsilon}$$

$$ b := b - \alpha \dfrac{V^{\text{corrected}}_{db}}{\sqrt{S^{\text{corrected}}_{db}}+\epsilon} $$

Hyperparameters Choice:

$$ \alpha : \text{ needs to be tuned} $$ $$ \beta_1: 0.9 \leftarrow (dW) $$ $$ \beta_2: 0.999 \leftarrow (dW^2) $$ $$ \epsilon: 10^{-8} $$

Adam: adaptive moment estimation.

Learning Rate Decay

One thing that might help speed up the learning algorithm is to slowly reduce the learning rate $\alpha$ over time.

This allows for faster approach to convergence near the end of the algorithm.

Recall 1 epoch = 1 pass through your entire training set.

$$ \alpha = \dfrac{1}{1 + \text{decay-rate} * \text{epoch-num}} * \alpha_0 $$

Example $$ \alpha_0 = 0.2 $$ $$ \text{decay-rate} = 1 $$

Epoch $\alpha$
1 0.1
2 0.67
3 0.5
4 0.4

Many different ways of learning rate decay, like exponential decay, discrete staircase, manual decay.

The Problem of Local Optima

Low dimensional spaces do not transfer to high dimensional spaces. The problem is of plateaus.

local_optimum

You are pretty unlikely to get stuck in local optima

plateau