Language isn’t a formal system. Language is glorious chaos. by Chris Manning
- Modern methods: Recurrent networks, attentions
- Big Picture
- Build system in PyTorch: Word meaning, dependency parsing, machine translation, question answering.
signifier(symbol) <=> signified(idea or thing) = denotational semantics
Common Solution Use e.g. WordNet containing lists of synonym sets and hypernyms
- missing nuance: proficient = good in some contexts
- new meanings of words: wicked, badass, ninja
- Requires human labor to create and adapt
- Can’t compute accurate and word similarity
one-hot vectors => encode similarity in the vector themselves
Distributional semantics: A word’s meaning is given by the words that frequently appear close-by.
Word Vectors: a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts.
Note: Word Vectors = Word Embeddings = Word Representations. All distributed representation
- a large corpus of text
- every word in a fixed vocabulary is represented by a vector
- go through each position t in the text, which has a center word c and context words o
- Use the similarity of the word vectors for c and o to calculate the probability of o given c
- Keep Adjusting the word vectors to maximize this probability
Note: theta is all variables to be optimized
sometimes called cost / loss function
Minimizing objective function <=> Maximizing predictive accuracy
Question: How to calculate P?
Answer: We use two vectors per word w
v_wwhen w is a center word
u_wwhen w is a context word
Then for a center word c and a context word o:
Recall: theta represents all model parameters, in one long vector
In our cases
Remember: every word has two vectors
We optimize these parameters by walking down the gradient