Stanford CS224n Natural Language Processing Course2

Course 2 - Word Vectors and Word Senses

Firstly, some analogy by calculating the linear space similarity by word vectors.

Vector Composilion

Definition of Code:

1
2
3
def analogy(x1, x2, y1):
result = model.most_similar(positive=[y1,x2], negative=[x1])
return result[0][0]

Some Implementations:

1
2
3
analogy('japan','japanese','australia') # australian
analogy('man','king','woman') # queen
analogy('australia','beer','france') # champagne

Next thing: PCA Scatterplot

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def display_pca_scatterplot(model, words=None, sample=0):
if words == None:
if sample > 0:
words = np.random.choice(list(model.vocab.keys()), sample)
else:
words = [ word for word in model.vocab ]

word_vectors = np.array([model[w] for w in words])

twodim = PCA().fit_transform(word_vectors)[:,:2]

plt.figure(figsize=(6,6))
plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
for word, (x,y) in zip(words, twodim):
plt.text(x+0.05, y+0.05, word)

Word Vectors and word2vec

1. Review: Main idea of word2vec

  • Iterate through each word of the whole corpus
  • Predict surrounding words using word vectors

Optimization Basics: Gradient Descent(cs229 Optimization)

Idea: for current value of theta, calculate gradient of J(theta), then take small step in the direction of negative gradient.

Update equation(in matrix notation)

Update equation(in a single parameter)

Algorithm:

1
2
3
while True:
theta_grad = evaluate_gradient(J, corpus, theta)
theta = theta - alpha * theta_grad

Stochastic Gradient Descent(SGD)

Solution: Repeatedly sample windows, and update after each one

Algorithm

1
2
3
4
while True:
window = sample_window(corpus)
theta_grad = evaluate_gradient(J, window, theta)
theta = theta - alpha * theta_grad

Stochastic Gradient with word vectors:

Iteratively take gradients at each such window for SGD

But in each window, we only have at most 2m+1 words, so it’s very sparse!

Solution:

  • need sparse matrix update operations to only update certain rows of full embedding matrices U and V
  • need to keep around a hash for word vectors

Word2Vec: More details

Two model variants:

  • Skip-grams(SG): Predict context words given center word
  • Continuous Bag of Words(CBOW): Predict center word from context words

=> Skip-gram model

Additional efficiency in training:

  • Negative sampling

Paper: “Distributed Representations of Words and Phrases and their Compositionality”(Mikolov et al.2013)

Overall objective function(maximize):

The sigmoid function:

So we maximize the probability of two words co-occurring in the first log

Maximize probability that real outside word appears, minimize prob. that random word appear around center word.

The unigram distribution U(w)

raised to the 3/4 power, which makes less frequent words be sampled more often.

Can we capture this essence more effectively by counting?

With a co-occurrence matrix X

  • 2 options: windows vs. full document
  • Window: Similar to word2vec, use window around each word -> captures both syntactic(POS) and semantic information
  • Word-document co-occurrence matrix will give general topic leading to Latent Semantic Analysis

Models are less robust due to increase in size with vocabulary.

Solution: Low dimensional vectors

Idea: store most information in a fixed, small number of dimensions

Singular Value Decomposition of co-occurrence matrix X - SVD

Hacks to X (several used in Rohde et al. 2005)

Problem: function words(the, he, has) are too frequent -> syntax has too much impact

Solution

  • min(X,t) t~= 100, Ignore them all
  • Use Pearson correlations instead of counts, then set negative values to 0

The GloVe model of word vectors

Crucial insight

Ratios of co-occurrence probabilities can encode meaning components

Merits

  • Fast training
  • Scalable to hug corpora
  • Good performance even with small corpus and small vectors

Evaluating word vectors

Intrinsic

  • Evaluation on a specific/intermediate subtask
  • Fast to compute
  • Helps to understand that system
  • Not clear if helpful

Methods:

  1. Cosine distance of word vectors
  2. Word vector distances and their correlation with human judgments

Extrinsic

  • Evaluation on a real task
  • Can take a long time to compute accuracy
  • Unclear if subsystem is the problem or its interaction
  • If replacing exactly one subsystem with another improves accuracy

Example: NER

Word senses

Word senses and word sense ambiguity

Most words have lots of meanings, especially common words.