Firstly, some analogy by calculating the linear space similarity by word vectors.
Definition of Code:
def analogy(x1, x2, y1):
analogy('japan','japanese','australia') # australian
Next thing: PCA Scatterplot
def display_pca_scatterplot(model, words=None, sample=0):
- Iterate through each word of the whole corpus
- Predict surrounding words using word vectors
Idea: for current value of theta, calculate gradient of J(theta), then take small step in the direction of negative gradient.
Update equation(in matrix notation)
Update equation(in a single parameter)
Solution: Repeatedly sample windows, and update after each one
Iteratively take gradients at each such window for SGD
But in each window, we only have at most 2m+1 words, so it’s very sparse!
- need sparse matrix update operations to only update certain rows of full embedding matrices U and V
- need to keep around a hash for word vectors
Two model variants:
- Skip-grams(SG): Predict context words given center word
- Continuous Bag of Words(CBOW): Predict center word from context words
=> Skip-gram model
Additional efficiency in training:
- Negative sampling
Paper: “Distributed Representations of Words and Phrases and their Compositionality”(Mikolov et al.2013)
Overall objective function(maximize):
The sigmoid function:
So we maximize the probability of two words co-occurring in the first log
Maximize probability that real outside word appears, minimize prob. that random word appear around center word.
The unigram distribution U(w)
raised to the 3/4 power, which makes less frequent words be sampled more often.
With a co-occurrence matrix X
- 2 options: windows vs. full document
- Window: Similar to word2vec, use window around each word -> captures both syntactic(POS) and semantic information
- Word-document co-occurrence matrix will give general topic leading to Latent Semantic Analysis
Models are less robust due to increase in size with vocabulary.
Solution: Low dimensional vectors
Idea: store most information in a fixed, small number of dimensions
Problem: function words(the, he, has) are too frequent -> syntax has too much impact
- min(X,t) t~= 100, Ignore them all
- Use Pearson correlations instead of counts, then set negative values to 0
Ratios of co-occurrence probabilities can encode meaning components
- Fast training
- Scalable to hug corpora
- Good performance even with small corpus and small vectors
- Evaluation on a specific/intermediate subtask
- Fast to compute
- Helps to understand that system
- Not clear if helpful
- Cosine distance of word vectors
- Word vector distances and their correlation with human judgments
- Evaluation on a real task
- Can take a long time to compute accuracy
- Unclear if subsystem is the problem or its interaction
- If replacing exactly one subsystem with another improves accuracy
Most words have lots of meanings, especially common words.