# Course 7 - Vanishing Gradients, Fancy RNN

## Vanishing Gradients

Gradient can be viewed as a measure of *the effect of the past on the future*

- There is no dependency between step $t$ and $t+n$ in the data
- We have wrong parameters to capture the true dependency between $t$ and $t+n$

## Effect of vanishing gradient on RNN-LM

LM task - unable to predict similar long-distance dependencies

Syntactic recency: The *writer* of the books **is**

Sequential recency: The writer of *books* **are**

Due to vanishing gradient, RNN-LMs are better at learning from sequential recency than syntactic

## Why is exploding gradient a problem?

### Solution: gradient clipping

Algorithm 1: Pseudo-code for norm clipping

$\hat{g} \leftarrow \frac{\partial \epsilon}{\partial \theta}$

if $||\hat{g}|| \geq threshold $ then

$\hat{g} \leftarrow \frac{threshold}{||\hat{g}||} \hat{g}$

end if

## Long Short-Term Memory(LSTM)

On step $t$, there is a hidden state $h^{(t)}$ and a cell state $c^{(t)}$

- Both are vectors length $n$
- The cell stores long-term information
- The LSTM can erase, write and read information from the cell

The selection is controlled by 3 corresponding **gates**

- vector length $n$
- each element of the gates can be open(1), closed(0), or in between
- dynamic: their value is computed based on the current context

We have a sequence of input $x^{(t)}$, and we will compute a sequence of hidden states $h^{(t)}$ and cell states $c^{(t)}$. On timestep $t$:

Forget gate - controls what is kept vs forgotten, from previous cell state

Input gate - controls what parts of the new cell content are written to cell

Output gate - controls what parts of cell are output to hidden state

New cell content - this is the new content to be written to the cell

Cell state - erase(“forget”) some content from last cell state, and write(“input”) some new cell content

Hidden state: read(“output”) some content from the cell

## Gated Recurrent Units(GRU)

a simpler alternative to the LSTM

On each timestep $t$ we have input $x^{(t)}$ and hidden state $h^{(t)}$ (no cell state)

Update gate - controls what parts of hidden state are updated vs preserved

Reset gate - controls what parts of previous hidden state are used to compute new content

New hidden state content - reset gate selects useful parts of prev hidden state. Use this and current input to compute new hidden content.

Hidden state - update gate simultaneously controls what is kept from previous hidden state, and what is updated to new hidden state content

## LSTM vs GRU

LSTM is a good default choice.

Rule of thumb: start with LSTM, but switch to GRU if you want something more efficient

## Vanishing/exploding gradient

add more direct connections.

e.g.: Residual connections, “ResNet”. Skip-connections. The identity connection preserves information by default.

e.g.: Dense connections, “DenseNet”.

e.g.:Highway connections, “HighwayNet”.

## Bidirectional RNNs

motivation - task: Sentiment Classification

Note: bidirection RNNs are only applicable if you have access to the entire input sentence.

BERT is built on bidirectionality.

## Multi-layer RNNs = stacked RNNs

High-performing RNNs are often multi-layer(but aren’t as deep as cn)