# Course 1 - Introduction and Word Vectors

Language isn’t a formal system. Language is glorious chaos. by Chris Manning

Target

• Modern methods: Recurrent networks, attentions
• Big Picture
• Build system in PyTorch: Word meaning, dependency parsing, machine translation, question answering.

## 1. how do we represent the meaning of a word?

signifier(symbol) <=> signified(idea or thing) = denotational semantics

## 2. how do we have usable meaning in a computer?

Common Solution Use e.g. WordNet containing lists of synonym sets and hypernyms

Problems

• missing nuance: proficient = good in some contexts
• new meanings of words: wicked, badass, ninja
• Subjective
• Requires human labor to create and adapt
• Can’t compute accurate and word similarity

### Representing words as discrete symbols

one-hot vectors => encode similarity in the vector themselves

Distributional semantics: A word’s meaning is given by the words that frequently appear close-by.

Word Vectors: a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts.

Note: Word Vectors = Word Embeddings = Word Representations. All distributed representation

## 3. Word2Vec: Overview

Idea:

• a large corpus of text
• every word in a fixed vocabulary is represented by a vector
• go through each position t in the text, which has a center word c and context words o
• Use the similarity of the word vectors for c and o to calculate the probability of o given c
• Keep Adjusting the word vectors to maximize this probability

Note: theta is all variables to be optimized

### Objective function J

sometimes called cost / loss function

Minimizing objective function <=> Maximizing predictive accuracy

Question: How to calculate P?

Answer: We use two vectors per word w

• v_w when w is a center word
• u_w when w is a context word

Then for a center word c and a context word o:

### To train the model: Compute all vector gradients

• Recall: theta represents all model parameters, in one long vector

• In our cases

• Remember: every word has two vectors

• We optimize these parameters by walking down the gradient