Stanford CS224n Natural Language Processing Course6 - ongoing

Course 6 - Language Models and RNNs

Language Modeling

is the task of predicting what word comes next.

also, assigns probability to a piece of text

n-gram Language Models


A n-gram is a chunk of n consecutive words


Collect statistics about how frequent different n-grams are, and use these to predict next word.

Sparity Problem 1: what if “students opened their w“ never occurred in data

Solution-Smoothing: Add small delta to the count for every w

Sparity Problem2: what if “students opened their” never occurred in data?

Solution-backoff: Just condition on “opened their” instead

How to build a neural Language Model

A fixed-window neural Language Model

Recurrent Neural Networks(RNN)


Core idea: Apply the same weight W repeatedly.



  • Can process any length input
  • use information from many steps back in theory
  • Model size doesn’t increase for longer input
  • Same weights applied on every steps.


  • slow computation
  • difficult to access information from many steps back practically

Training a RNN Language Model

  • Get a big corpus of text which is a sequence of word $x^{1}, \cdots, x^{T}$

  • Feed into RNN-LM

  • Loss function on step t is cross-entropy

  • Average this to get overall loss for entire training set

  • However, computing loss and gradients across entire corpus is too expensive. In practice, consider $x^{1}, \cdots, x^{T}$ as a sentence (or a document)

  • Instead, using SGD to compute loss $J(\theta)$ for a sentence, compute gradients and update weights. Repeat.

Backpropagation for RNNs

Questions: derivative of $J^{(t)}(\theta)$ w.r.t the repeated weight matrix