Stanford CS224n Natural Language Processing Course8

Course 8 - Translation, Seq2Seq, Attention

Machine Translation

from the source language to the target language

Statistical Machine Translation

Core idea: learn a probabilistic model from data.

best English sentence $y$, given French sentence $x$

Translation model($P(x|y)$) - Models how words and phrases should be translated. Learnt from parallel data

note: parallel data - pairs of human-translated french/english sentences.

Language model($P(y)$) - Models how to write good English. Learnt from monolingual data.

Learning alignment for SMT

alignment is the correspondence between particular words in the translated sentence pair.

Alignment can be many-to-one & one-to-many & many-to-many

We learn $P(x, a|y)$ as a combination of many factors.

Decoding for SMT

Use a heuristic search algorithm to search for the best translation, discarding hypotheses that are too low-probability. This process is called decoding

Neural Machine Translation

sequence-to-sequence(Seq2seq) involves 2 RNNs. a conditional language model

  • conditional - its predictions are conditioned on the source sentence $x$


Other tasks

  • summarization
  • dialogue
  • parsing
  • code generation (natural language -> Python code)

NMT directly calculates $P(y|x)$

Question: How to train?

Answer: Get a big parallel corpus…

Greedy decoding has many problems, it’s an exhaustive search decoding, instead we use beam search decoding

Core idea: On each step of decoder, keep track of the k most probable partial translations (hypotheses)

  • k is the beam size
  • Scores are all negative, and higher score is betterr
  • We search for high-scoring hypotheses, tracking top k on each step

Beam search is not guaranteed to find optimal solution, but efficient.

In greedy decoding, usually we decode until the model produces a token. For example, he hit me with a pie

Problem:How to select top one with highest score? longer hypotheses have lower scores

Fix: Normalize by length.

Advantages of NMT

  • better performance.
    • more fluent
    • better use of context
    • better use of phrase similarities
  • A single neural network to be optimized end-to-end
    • No subcomponents to be individually optimized
  • Requires much less human engineering effort
    • No feature engineering

Disadvantages of NMT

Compared to SMT:

  • NMT is less interpretable
    • Hard to debug
  • NMT is difficult to control

How to evaluate?

BLEU(Bilingual Evaluation Understudy)

BLEU compares the machine-written translation to one or several human-written translations, and computes similarity-score based on:

  • n-gram precision
  • Plus a penalty for too-short system tranlations

Difficulties remain

Out-of-vocabulary words

Domain mismatch between train and test data

Maintaining context over longer text

Low-resource language pairs

NMT picks up biases in training data

Uninterpretable system do strange things.


Bottleneck problem.

definition - Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query.


  • Basic dot-product attention
  • Multiplicative attention
  • Additive attention