Stanford CS224n Natural Language Processing Course7

Course 7 - Vanishing Gradients, Fancy RNN

Vanishing Gradients

Gradient can be viewed as a measure of the effect of the past on the future

  • There is no dependency between step $t$ and $t+n$ in the data
  • We have wrong parameters to capture the true dependency between $t$ and $t+n$

Effect of vanishing gradient on RNN-LM

LM task - unable to predict similar long-distance dependencies

Syntactic recency: The writer of the books is

Sequential recency: The writer of books are

Due to vanishing gradient, RNN-LMs are better at learning from sequential recency than syntactic

Why is exploding gradient a problem?

Solution: gradient clipping

Algorithm 1: Pseudo-code for norm clipping

$\hat{g} \leftarrow \frac{\partial \epsilon}{\partial \theta}$

if $||\hat{g}|| \geq threshold $ then

​ $\hat{g} \leftarrow \frac{threshold}{||\hat{g}||} \hat{g}$

end if

Long Short-Term Memory(LSTM)

On step $t$, there is a hidden state $h^{(t)}$ and a cell state $c^{(t)}$

  • Both are vectors length $n$
  • The cell stores long-term information
  • The LSTM can erase, write and read information from the cell

The selection is controlled by 3 corresponding gates

  • vector length $n$
  • each element of the gates can be open(1), closed(0), or in between
  • dynamic: their value is computed based on the current context

We have a sequence of input $x^{(t)}$, and we will compute a sequence of hidden states $h^{(t)}$ and cell states $c^{(t)}$. On timestep $t$:

Forget gate - controls what is kept vs forgotten, from previous cell state

Input gate - controls what parts of the new cell content are written to cell

Output gate - controls what parts of cell are output to hidden state

New cell content - this is the new content to be written to the cell

Cell state - erase(“forget”) some content from last cell state, and write(“input”) some new cell content

Hidden state: read(“output”) some content from the cell


Gated Recurrent Units(GRU)

a simpler alternative to the LSTM

On each timestep $t$ we have input $x^{(t)}$ and hidden state $h^{(t)}$ (no cell state)

Update gate - controls what parts of hidden state are updated vs preserved

Reset gate - controls what parts of previous hidden state are used to compute new content

New hidden state content - reset gate selects useful parts of prev hidden state. Use this and current input to compute new hidden content.

Hidden state - update gate simultaneously controls what is kept from previous hidden state, and what is updated to new hidden state content


LSTM is a good default choice.

Rule of thumb: start with LSTM, but switch to GRU if you want something more efficient

Vanishing/exploding gradient

add more direct connections.

e.g.: Residual connections, “ResNet”. Skip-connections. The identity connection preserves information by default.

e.g.: Dense connections, “DenseNet”.

e.g.:Highway connections, “HighwayNet”.

Bidirectional RNNs

motivation - task: Sentiment Classification


Note: bidirection RNNs are only applicable if you have access to the entire input sentence.

BERT is built on bidirectionality.

Multi-layer RNNs = stacked RNNs

High-performing RNNs are often multi-layer(but aren’t as deep as cn)