Generally we have a training dataset consisting of samples
x_i are inputs, eg words, sentences, documents, etc
y_j are labels we try to predict.
- classes: sentiment, named entities, buy/sell decision
- other words
- later: multi-word sequences
Traditional ML/Stats approach: assume x_i are fixed, train softmax/logistic regression weights W in R(C,d) to determine a decision boundary as in the picture.
For each x, predict:
For each training example (x,y), to maximize the probability of the correct class y, or minimize the negative log probability
FYI: Cross entropy in the full dataset is
Commonly in NLP deep learning:
- We learn both W and word vectors X
- We learn both conventional parameters and representations
The task: find and classify name in text.
We predict entities by classifying words in context and then extracting entities as word subsequences.
- hard to work out boundaries of entity
- hard to know if something is an entity
- hard to know class of unknown/novel entity
- entity class is ambiguous and depends on context
Idea: classify a word in its context window of neighboring words.
Solution: train softmax classifer to classify a center word by taking concatenation of word vectors surrounding it in a window