Stanford CS224n Natural Language Processing Course1

Course Link(including Slides and Materials)

Online Video Link

Online Video Link in Chinese

Course 1 - Introduction and Word Vectors

Language isn’t a formal system. Language is glorious chaos. by Chris Manning

Target

  • Modern methods: Recurrent networks, attentions
  • Big Picture
  • Build system in PyTorch: Word meaning, dependency parsing, machine translation, question answering.

1. how do we represent the meaning of a word?

signifier(symbol) <=> signified(idea or thing) = denotational semantics

2. how do we have usable meaning in a computer?

Common Solution Use e.g. WordNet containing lists of synonym sets and hypernyms

Problems

  • missing nuance: proficient = good in some contexts
  • new meanings of words: wicked, badass, ninja
  • Subjective
  • Requires human labor to create and adapt
  • Can’t compute accurate and word similarity

Representing words as discrete symbols

one-hot vectors => encode similarity in the vector themselves

Distributional semantics: A word’s meaning is given by the words that frequently appear close-by.

Word Vectors: a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts.

Note: Word Vectors = Word Embeddings = Word Representations. All distributed representation

3. Word2Vec: Overview

Idea:

  • a large corpus of text
  • every word in a fixed vocabulary is represented by a vector
  • go through each position t in the text, which has a center word c and context words o
  • Use the similarity of the word vectors for c and o to calculate the probability of o given c
  • Keep Adjusting the word vectors to maximize this probability

Note: theta is all variables to be optimized

Objective function J

sometimes called cost / loss function

Minimizing objective function <=> Maximizing predictive accuracy

Question: How to calculate P?

Answer: We use two vectors per word w

  • v_w when w is a center word
  • u_w when w is a context word

Then for a center word c and a context word o:

Prediction function

To train the model: Compute all vector gradients

  • Recall: theta represents all model parameters, in one long vector

  • In our cases

  • Remember: every word has two vectors

  • We optimize these parameters by walking down the gradient