Taking the Coursera Deep Learning Specialization, **Neural Networks and Deep Learning** course. Will post condensed notes every week as part of the review process. All material originates from the free Coursera course, taught by Andrew Ng. See deeplearning.ai for more details.

Assumes you have knowledge of Week 3.

# Table of Contents

# Deep Neural Networks

## Deep Neural Network

### Deep L-layer neural network

Capital $L$ denotes the number of layers in the network. $ L = 4 $

We use $n^{[l]}$ to denote number of units in layer $l$.

$$ n^{[0]} = n_x = 3, n^{[1]} = 5, n^{[2]} = 5, n^{[3]} = 3, n^{[4]} = 1, n^{[5]} = 1 $$

### Forward Propagation in a Deep Network

$$ Z^{[l]} = W^{[l]} a ^{[l-1]} + b^{[l]} $$ $$ a^{[l]} = g^{[l]}(Z^{[l]}) $$

Vectorized:

$$ X = A^{[0]} $$ $$ Z^{[1]} = W^{[1]} X + b^{[l]} $$ $$ A^{[1]} = g^{[1]}(Z^{[1]}) $$

### Getting your matrix dimensions right

$$ W^{[1]} : (n^{[1]}, n^{[0]}) $$

$$ W^{[l]} : (n^{[l]}, n^{[l-1]}) $$

The shape of $b$ should be $b^{[l]} : (n^{[l]}, 1) $.

### Why deep representations?

Composing functions of increasing complexity, ie consider a face classifier - detect edges -> detect eyes, or noses -> detect groupings of eyes and noses

Circuit theory and deep learning:

Informally: There are functions you can compute with a “small” L-layer deep neural network that shallower networks require exponentially more hidden units to compute.

### Building blocks of deep neural networks

Z is cached and used in both forward and back propagation.

### Forward and Backward Propagation

Forward propagation

- input $a^{[l-1]}$
- output $a^{[l]}$, cache $(z^{[l]})$

$$ z^{[l]} = w^{[l]} z^{[l-1]} + b^{[l]} $$ $$ a^{[l]} = g^{[l]}(z^{[l]}) $$

Vectorized

$$ Z^{[l]} = W^{[l]} A^{[l-1]} = b^{[l]} $$ $$ A^{[l]} = g^{[l]} (Z^{[l]}) $$

Back propagation

- input $da^{[l]}$
- output $da^{[l-1]}, dW^{[l]}, db^{[l]}$

$$ dz^{[l]} = da^{[l]} \times g^{[l]}‘(z^{[l]}) $$ $$ dW^{[l]} = dz^{[l]} \times a^{[l-1]} $$ $$ db^{[l]} = dz^{[l]} $$ $$ dz^{[l-1]} = w^{[l]T} \times dz^{[l]} $$

$$ da^{[l]} = -\dfrac{y}{a} + \dfrac{(1-y)}{(1-a)} $$

### Parameters vs Hyperparameters

Parameters $ W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}, \dots $

Hyperparameters:

- learning rate $\alpha$
- number of iterations
- number of hidden layers L
- number of hidden units per layer
- choice of activation function per layer

Later hyperparameters

- momentum
- minibatch size
- regularizations

Applied deep learning is a very empirical process.

```
Idea -> Code -> Experiment
<- Repeat <-
```

### What does this have to do with the brain?

Less like brain, more like universal function approximator.