Stanford CS224n Natural Language Processing Course4

Course 4 - Backpropagation and computation graphs

Matrix gradients for our simple neural net and some tips

Chain rule:

Question Should I use available “pre-trained” word vectors?

Answer almost always

Question Should I update(“fine tune”) my own word vectors?

Answer

  • If you only have a small training data set, don’t train the word vectors
  • If you have a large dataset, it probably will work better to train = update = fine-tune word vectors to the task

Computation graphs and backpropagation

1
2
3
4
5
6
7
8
9
10
class MultiplyGate(object):
def forward(x,y):
z = x*y
self.x = x
self.y = y
return z
def backward(dz):
dx = self.y * dz
dy = self.x * dz
return [dx,dy]

Stuff you should know

Regularization to prevent overfitting

Prevents overfitting when we have a lot of features

Vectorization

Nonlinearities

logistic(“sigmoid”)

tanh

hard tanh

ReLU:

Leaky ReLU/Parametric ReLU

Initialization

Xavier initialzation has variance inversely proportional to fan-in n_in and fan-out n_out:

Optimizers

SGD/Adagrad/RMSprop/Adak/SparseAdam

Learning rates

  • You can just use a constant learning rate

  • Better results can generally be obtained by allowing learning rates to decrease as you train

    • by hand: halve the learning rate every k epochs

    • By a formula, for epoch t

    • Fancier method like cyclic learning rates(q.v.)

  • Fancier optimizers still use a learning rate but it may be an initial rate that the optimizer shrinks - so may be able to start high