# Course 4 - Backpropagation and computation graphs

## Matrix gradients for our simple neural net and some tips

Chain rule:

**Question** Should I use available “pre-trained” word vectors?

**Answer** almost always

**Question** Should I update(“fine tune”) my own word vectors?

**Answer**

- If you only have a
**small**training data set,**don’t**train the word vectors - If you have a
**large**dataset, it probably will work better to**train = update = fine-tune**word vectors to the task

## Computation graphs and backpropagation

1 | class MultiplyGate(object): |

## Stuff you should know

### Regularization to prevent overfitting

Prevents **overfitting** when we have a lot of features

### Vectorization

### Nonlinearities

logistic(“sigmoid”)

tanh

hard tanh

ReLU:

Leaky ReLU/Parametric ReLU

### Initialization

Xavier initialzation has variance inversely proportional to fan-in n_in and fan-out n_out:

### Optimizers

SGD/Adagrad/RMSprop/Adak/SparseAdam

### Learning rates

You can just use a constant learning rate

Better results can generally be obtained by allowing learning rates to decrease as you train

by hand: halve the learning rate every k epochs

By a formula, for epoch t

Fancier method like cyclic learning rates(q.v.)

Fancier optimizers still use a learning rate but it may be an initial rate that the optimizer shrinks - so may be able to start high