Stanford CS224n Natural Language Processing Course3

Course 3 - Neural Networks

Classification review/introduction

Generally we have a training dataset consisting of samples

x_i are inputs, eg words, sentences, documents, etc

y_j are labels we try to predict.

  • classes: sentiment, named entities, buy/sell decision
  • other words
  • later: multi-word sequences

Traditional ML/Stats approach: assume x_i are fixed, train softmax/logistic regression weights W in R(C,d) to determine a decision boundary as in the picture.

Method: Training with softmax and cross-entropy loss

For each x, predict:

For each training example (x,y), to maximize the probability of the correct class y, or minimize the negative log probability

FYI: Cross entropy in the full dataset is

Neural networks introduction

Commonly in NLP deep learning:

  • We learn both W and word vectors X
  • We learn both conventional parameters and representations

Named Entity Recognition

The task: find and classify name in text.

NER on word sequences

We predict entities by classifying words in context and then extracting entities as word subsequences.

BIAO encoding

Why might NER be hard?

  • hard to work out boundaries of entity
  • hard to know if something is an entity
  • hard to know class of unknown/novel entity
  • entity class is ambiguous and depends on context

Binary true vs. corrupted word window classification

Binary word window classification

Problem: ambiguity

Idea: classify a word in its context window of neighboring words.

Solution: train softmax classifer to classify a center word by taking concatenation of word vectors surrounding it in a window