# Course 6 - Language Models and RNNs

## Language Modeling

is the task of predicting what word comes next.

also, **assigns probability to a piece of text**

### n-gram Language Models

#### Definition

A **n-gram** is a chunk of n consecutive words

#### Idea

Collect statistics about how frequent different n-grams are, and use these to predict next word.

#### Sparity Problem 1: what if “students opened their *w*“ never occurred in data

Solution-Smoothing: Add small delta to the count for every *w*

#### Sparity Problem2: what if “students opened their” never occurred in data?

Solution-backoff: Just condition on “opened their” instead

## How to build a *neural* Language Model

### A fixed-window neural Language Model

### Recurrent Neural Networks(RNN)

**Core idea**: Apply the same weight W repeatedly.

**Advantages**

- Can process any length input
- use information from many steps back in theory
- Model size doesn’t increase for longer input
- Same weights applied on every steps.

**Disadvantages**

- slow computation
- difficult to access information from many steps back practically

#### Training a RNN Language Model

Get a

**big corpus of text**which is a sequence of word $x^{1}, \cdots, x^{T}$Feed into RNN-LM

Loss function on step t is

**cross-entropy**Average this to get

**overall loss**for entire training setHowever, computing loss and gradients across

**entire corpus**is too expensive. In practice, consider $x^{1}, \cdots, x^{T}$ as a**sentence**(or a**document**)- Instead, using
**SGD**to compute loss $J(\theta)$ for a sentence, compute gradients and update weights. Repeat.

#### Backpropagation for RNNs

Questions: derivative of $J^{(t)}(\theta)$ w.r.t the **repeated** weight matrix

Answer: