from the source language to the target language
Core idea: learn a probabilistic model from data.
best English sentence $y$, given French sentence $x$
Translation model($P(x|y)$) - Models how words and phrases should be translated. Learnt from parallel data
note: parallel data - pairs of human-translated french/english sentences.
Language model($P(y)$) - Models how to write good English. Learnt from monolingual data.
alignment is the correspondence between particular words in the translated sentence pair.
Alignment can be many-to-one & one-to-many & many-to-many
We learn $P(x, a|y)$ as a combination of many factors.
Use a heuristic search algorithm to search for the best translation, discarding hypotheses that are too low-probability. This process is called decoding
sequence-to-sequence(Seq2seq) involves 2 RNNs. a conditional language model
- conditional - its predictions are conditioned on the source sentence $x$
- code generation (natural language -> Python code)
NMT directly calculates $P(y|x)$
Question: How to train?
Answer: Get a big parallel corpus…
Greedy decoding has many problems, it’s an exhaustive search decoding, instead we use beam search decoding
Core idea: On each step of decoder, keep track of the k most probable partial translations (hypotheses)
- k is the beam size
- Scores are all negative, and higher score is betterr
- We search for high-scoring hypotheses, tracking top k on each step
Beam search is not guaranteed to find optimal solution, but efficient.
In greedy decoding, usually we decode until the model produces a
Problem：How to select top one with highest score? longer hypotheses have lower scores
Fix: Normalize by length.
- better performance.
- more fluent
- better use of context
- better use of phrase similarities
- A single neural network to be optimized end-to-end
- No subcomponents to be individually optimized
- Requires much less human engineering effort
- No feature engineering
Compared to SMT:
- NMT is less interpretable
- Hard to debug
- NMT is difficult to control
BLEU(Bilingual Evaluation Understudy)
BLEU compares the machine-written translation to one or several human-written translations, and computes similarity-score based on:
- n-gram precision
- Plus a penalty for too-short system tranlations
Domain mismatch between train and test data
Maintaining context over longer text
Low-resource language pairs
NMT picks up biases in training data
Uninterpretable system do strange things.
definition - Given a set of vector values, and a vector query, attention is a technique to compute a weighted sum of the values, dependent on the query.
- Basic dot-product attention
- Multiplicative attention
- Additive attention