# Course 2 - Word Vectors and Word Senses

Firstly, some analogy by calculating the linear space similarity by word vectors.

Definition of Code:

1 | def analogy(x1, x2, y1): |

Some Implementations:

1 | analogy('japan','japanese','australia') # australian |

Next thing: **PCA Scatterplot**

1 | def display_pca_scatterplot(model, words=None, sample=0): |

## Word Vectors and word2vec

### 1. Review: Main idea of word2vec

- Iterate through each word of the whole corpus
- Predict surrounding words using word vectors

## Optimization Basics: Gradient Descent(cs229 Optimization)

Idea: for current value of theta, calculate gradient of J(theta), then take **small step in the direction of negative gradient**.

**Update equation**(in matrix notation)

**Update equation**(in a single parameter)

Algorithm:

1 | while True: |

### Stochastic Gradient Descent(SGD)

Solution: Repeatedly sample windows, and update after each one

Algorithm

1 | while True: |

### Stochastic Gradient with word vectors:

Iteratively take gradients at each such window for SGD

But in each window, we only have at most *2m+1* words, so it’s very sparse!

Solution:

- need sparse matrix update operations to only update certain rows of full embedding matrices
*U*and*V* - need to keep around a hash for word vectors

### Word2Vec: More details

Two model variants:

- Skip-grams(SG): Predict context words given center word
- Continuous Bag of Words(CBOW): Predict center word from context words

=> **Skip-gram model**

Additional efficiency in training:

- Negative sampling

Paper: “Distributed Representations of Words and Phrases and their Compositionality”(Mikolov et al.2013)

**Overall objective function**(maximize):

The sigmoid function:

So we maximize the probability of two words co-occurring in the first log

Maximize probability that real outside word appears, minimize prob. that random word appear around center word.

The unigram distribution U(w)

raised to the 3/4 power, which makes less frequent words be sampled more often.

## Can we capture this essence more effectively by counting?

With a co-occurrence matrix *X*

- 2 options: windows vs. full document
- Window: Similar to word2vec, use window around each word -> captures both syntactic(POS) and semantic information
- Word-document co-occurrence matrix will give general topic leading to
*Latent Semantic Analysis*

Models are less robust due to increase in size with vocabulary.

Solution: Low dimensional vectors

Idea: store most information in a fixed, small number of dimensions

### Singular Value Decomposition of co-occurrence matrix *X* - SVD

### Hacks to X (several used in Rohde et al. 2005)

Problem: function words(*the, he, has*) are too frequent -> syntax has too much impact

Solution

- min(X,t) t~= 100, Ignore them all
- Use Pearson correlations instead of counts, then set negative values to 0

## The GloVe model of word vectors

### Crucial insight

Ratios of co-occurrence probabilities can encode meaning components

**Merits**

- Fast training
- Scalable to hug corpora
- Good performance even with small corpus and small vectors

## Evaluating word vectors

### Intrinsic

- Evaluation on a specific/intermediate subtask
- Fast to compute
- Helps to understand that system
- Not clear if helpful

Methods:

- Cosine distance of word vectors
- Word vector distances and their correlation with human judgments

### Extrinsic

- Evaluation on a real task
- Can take a long time to compute accuracy
- Unclear if subsystem is the problem or its interaction
- If replacing exactly one subsystem with another improves accuracy

Example: NER

## Word senses

#### Word senses and word sense ambiguity

Most words have lots of meanings, especially common words.