Taking the Coursera Deep Learning Specialization, **Neural Networks and Deep Learning** course. Will post condensed notes every week as part of the review process. All material originates from the free Coursera course, taught by Andrew Ng. See deeplearning.ai for more details.

Assumes you have knowledge of Week 1.

# Table of Contents

# Neural Networks Basics

## Logistic Regression as a Neural Network

### Binary Classification

Binary classification is basically answering a yes or no question. For example: Is this an image of a cat? (1: Yes, 0: No).

Let’s say you have an image of a cat that is 64 by 64 pixels. You have labeled training data indicating whether or not each image is a cat (`y=1`

) or not a cat (`y=0`

).

**Notation**

Let’s say each picture can be represented as a single vector of size $n_x$ combined by joining three vectors (64 * 64 red pixel values) + (64 * 64 green pixel values) + (64 * 64 blue pixel values).

$$ n_x = \text{ unrolled image vector size } = 12288 $$ $$ m \text{ training examples } = \{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(m)}, y^{(m)}) \} $$ $$ x \in \mathbb{R}^{n_x}, y \in \{ 0, 1 \} $$ $$ X \in \mathbb{R}^{n_x \times m} $$ $$ Y \in \mathbb{R}^{1 \times m} $$ $$ X = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \newline x^{(1)} & x^{(2)} & \dots & x^{(m)} \newline \vdots & \vdots & \vdots & \vdots \end{bmatrix} $$ $$ Y = \begin{bmatrix} y^{(1)} & y^{(2)} & \dots y^{(m)} \end{bmatrix} $$

### Logistic Regression

Logistic regression is when you want to have an answer in a continuous output. For instance, with the image of a cat problem, rather than having whether or not the image is of a cat or not, one could ask “What is the probability that this is a cat?”

**Notation**

Given $x$, want $\hat{y} = P(y=1 | x)$, $$ x \in \mathbb{R}^{n_x} $$ $$ 0 \leq \hat{y} \leq 1 $$ $$ \text{Parameters: } w \in \mathbb{R}^{n_x}, b \in \mathbb{R} $$ $$ \text{Output: } \hat{y} = \sigma(w^Tx + b) $$ $$ z = w^Tx + b $$ $$ \sigma(z) = \dfrac{1}{1 + e^{-z}} $$

If $z$ is a large positive number then $\sigma(z) = \dfrac{1}{1 + 0} \approx 1 $.

If $z$ is a large negative number then $\sigma(z) = \dfrac{1}{1 + \inf} \approx 0 $.

### Logistic Regression Cost Function

A loss function is applied to a single training example. For logistic regression, typical loss function used is:

$$ \mathcal{L}(\hat{y}, y) = -(y\log{\hat{y}} + (1-y)\log{(1-\hat{y})}) $$

- If $y = 1$; $ \mathcal{L}(\hat{y}, y) = -\log{\hat{y}} $
- Want $\log{\hat{y}}$ to be large, we want $\hat{y}$ to be large.

- If $y = 0$; $ \mathcal{L}(\hat{y}, y) = -\log{(1-\hat{y})}$
- Want $\log{(1-\hat{y})}$ to be large, we want $\hat{y}$ to be small.

A cost function is applied to the entire training set, it evaluates the parameters of your algorithm. (Cost of your parameters).

$$ J(w, b) = \dfrac{1}{m} \sum\limits^{m}_{i=1} \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) $$ $$ J(w, b) = -[\dfrac{1}{m} \sum\limits^{m}_{i=1} y^{(i)} \log{\hat{y}^{(i)}} + (1-y^{(i)})\log{(1-\hat{y}^{(i)})}] $$

### Gradient Descent

The cost function measures how well $w, b$ measure the training set. We want to find the $w, b$ that minimize $J(w, b)$.

Repeat { $$ w := w - \alpha \dfrac{\partial J(w, b)}{\partial w} $$ $$ b := b - \alpha \dfrac{\partial J(w, b)}{\partial b} $$ }

Typically, in code, the derivative term is written as `dw`

. Example: `w = w - alpha * dw`

.

### Derivatives

You don’t need a lot of calculus to understand neural networks. This is a basic example of the derivative of a straight line $f(a) = 3a$.

### More Derivatives Examples

This is another example of the derivative of $f(a) = a^2$.

Here are three examples: $f(a) = a^2$, $f(a) = a^3$, and $f(a) = \log{a}$.

Take home: - Derivative just means the slope of the line. - You want to find slope? Look at calculus textbook.

### Computation Graph

Computation graph is a left to right pass visualization of the math behind your algorithm.

### Derivatives with a Computation Graph

Recall calculus, chain rule.

)

### Logistic Regression Gradient Descent

Recall the follwing logistic regression formula defined above.

### Gradient Descent on m Examples

Recall the cost function:

$$ J(w, b) = \dfrac{1}{m} \sum\limits^m_{i=1} \mathcal{L}(a^{(i)},y^{(i)}) $$ $$ a^{(i)} = \hat{y}^{(i)} = \sigma(z^{(i)}) = \sigma(w^Tx^{(i)} + b) $$

This is the naive formula for a single step of logistic regression on $m$ examples with $n = 2$ (two features) using gradient descent.

`begin single step of gradient descent`

$ J = 0; dw_1 = 0; dw_2 = 0; db = 0 $
`// define accumulator values`

For $i = 1 \text{ to } m$ do {
$$ z^{(i)} = w^Tx^{(i)} + b $$
$$ a^{(i)} = \sigma(z^{(i)}) $$
$$ J = - [ y^{(i)} \log(a^{(i)}) + (1-y^{(i)})\log(1-a^{(i)}) ] $$
$$ dz^{(i)} = a^{(i)} - y^{(i)} $$
$$ dw_1 = dw_1 + x_1^{(i)} \times dz^{(i)} $$
$$ dw_2 = dw_2 + x_2^{(i)} \times dz^{(i)} $$
`// if n were greater than two, continue to do this for dw_3, etc`

$$ db = db + dz^{(i)} $$
}

$ J = \dfrac{J}{m} $; $ dw_1 = \dfrac{dw_1}{m} $; $ dw_2 = \dfrac{dw_2}{m} $; $ db = \dfrac{db}{m} $;

`end single step of gradient descent`

For each step of gradient descent, you need to do effectively two for loops:

- for your $m$ number of training examples
- for your $n$ number of example features.

This is why vectorization is important in deep learning.

## Python and Vectorization

### Vectorization

Vectorization is the art of getting rid of explicit for loops in code.

Example: $ z = w^Tx + b $ where $ w \in \mathbb{R}^{n_x} $ and $ x \in \mathbb{R}^{n_x} $

```
// non vectorized
z = 0
for i in range(n-x):
z += w[i] * x[i]
z += b
//vectorized
import numpy as np
z = np.dot(w,x) + b
```

The following is a vectorization demo.

```
import numpy as np
import time
a = np.random.rand(1000000)
b = np.random.rand(1000000)
tic = time.time()
c = np.dot(a, b)
toc = time.time()
print("vectorized version: " + str(1000 * (toc-tic)) + "ms")
# vectorized version: 14.4419670105ms
c = 0
tic = time.time()
for i in range(1000000):
c += a[i]*b[i]
toc = time.time()
print("non-vectorized version: " + str(1000 * (toc-tic)) + "ms")
# non-vectorized version: 428.48610878ms
```

Vectorization increases performance by allowing the program to take advantage of parallelization. Wherever possible, avoid for loops.

### More Vectorization Examples

Whenever possible, avoid explicit for-loops.

```
# matrix times a vector, vectorized
A = np.dot(A, v)
# apply exponential operation on every element of a matrix/vector
u = np.zeros((n, 1))
for i in range(n):
u[i] = math.exp(v[i])
# or vectorized
u = np.exp(v)
np.log(v) # element wise log
np.abs(v) # elementwise abs
np.maximum(v, 0) # ReLU
```

### Vectorizing Logistic Regression

We want to calculate:

for i in range of 1 to m { $$ z^{(i)} = w^Tx^{(i)} + b $$ $$ a^{(i)} = \sigma(z^{(i)}) $$ }

Recall that $X$ is in the shape of $(n_x, m)$, making it an $\mathbb{R}^{n_x \times m}$ sized matrix

$$ X = \begin{bmatrix} \vdots & \vdots & \vdots & \vdots \newline x^{(1)} & x^{(2)} & \dots & x^{(m)} \newline \vdots & \vdots & \vdots & \vdots \end{bmatrix} $$

$$ Z = w^TX+b $$

```
Z = np.dot(w.T, X) + b
# Z is a row vector of size m
A = sigmoid(Z)
```

### Vectorizing Logistic Regression’s Gradient Output

`db = 1 / m * (np.sum(dZ))`

### Broadcasting in Python

```
import numpy as np
A = np.array([[56.0, 0.0, 4.4, 68.0],
[1.2, 104.0, 52.0, 8.0],
[1.8, 135.0, 99.0, 0.9]])
cal = A.sum(axis=0)
print(cal)
# [59. 239. 155.4 76.9]
percentage = 100*A/cal.reshape(1,4)
print(percentage)
#[[ 94.91525424 0. 2.83140283 88.42652796]
# [ 2.03389831 43.51464435 33.46203346 10.40312094]
# [ 3.05084746 56.48535565 63.70656371 1.17035111]]
```

Python does some magic in broadcasting for matrix/array operations:

### Note on Python/NumPy Vectors

Broadcasting may introduce subtle bugs in code, as column/row mismatch no longer is thrown

```
import numpy as np
a = np.random.randn(5) # avoid rank 1 arrays, explicitly define your column vector (5, 1) or row vector (1, 5)
print(a)
# [ 1.2, 2.3, 3.4, 4.5, 5.6 ]
print(a.shape)
# (5,)
print(a.T)
# [ 1.2, 2.3, 3.4, 4.5, 5.6 ]
a = np.random.randn(5, 1)
print(a)
# [[1.2]
# [2.3]
# [3.4]
# [4.5]
# [5.6]]
```

Occastionally assert your shape when you’re not sure `assert(a.shape == (5, 1))`

.

Move on to Week 3.