# Course 8 - Translation, Seq2Seq, Attention

## Machine Translation

from the source language to the target language

### Statistical Machine Translation

Core idea: learn a probabilistic model from data.

best English sentence $y$, given French sentence $x$

Translation model($P(x|y)$) - Models how words and phrases should be translated. Learnt from parallel data

note: parallel data - pairs of human-translated french/english sentences.

Language model($P(y)$) - Models how to write good English. Learnt from monolingual data.

#### Learning alignment for SMT

alignment is the correspondence between particular words in the translated sentence pair.

Alignment can be many-to-one & one-to-many & many-to-many

We learn $P(x, a|y)$ as a combination of many factors.

#### Decoding for SMT

Use a heuristic search algorithm to search for the best translation, discarding hypotheses that are too low-probability. This process is called decoding

### Neural Machine Translation

sequence-to-sequence(Seq2seq) involves 2 RNNs. a conditional language model

• conditional - its predictions are conditioned on the source sentence $x$

• summarization
• dialogue
• parsing
• code generation (natural language -> Python code)

NMT directly calculates $P(y|x)$

Question: How to train?

Answer: Get a big parallel corpus…

Greedy decoding has many problems, it’s an exhaustive search decoding, instead we use beam search decoding

Core idea: On each step of decoder, keep track of the k most probable partial translations (hypotheses)

• k is the beam size
• Scores are all negative, and higher score is betterr
• We search for high-scoring hypotheses, tracking top k on each step

Beam search is not guaranteed to find optimal solution, but efficient.

In greedy decoding, usually we decode until the model produces a token. For example, he hit me with a pie

Problem：How to select top one with highest score? longer hypotheses have lower scores

Fix: Normalize by length.

• better performance.
• more fluent
• better use of context
• better use of phrase similarities
• A single neural network to be optimized end-to-end
• No subcomponents to be individually optimized
• Requires much less human engineering effort
• No feature engineering

Compared to SMT:

• NMT is less interpretable
• Hard to debug
• NMT is difficult to control

#### How to evaluate?

BLEU(Bilingual Evaluation Understudy)

BLEU compares the machine-written translation to one or several human-written translations, and computes similarity-score based on:

• n-gram precision
• Plus a penalty for too-short system tranlations

#### Difficulties remain

Out-of-vocabulary words

Domain mismatch between train and test data

Maintaining context over longer text

Low-resource language pairs

NMT picks up biases in training data

Uninterpretable system do strange things.

#### Attention

Bottleneck problem.

definition - Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query.

variants

• Basic dot-product attention
• Multiplicative attention