Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
2024-02-29
Keywords: #Batch Normalization
0. Abstract
- Training DNN is complicated because the distribution of each layer’s inputs changes during training. → Requires lower learning rate, careful param initialization, leads to saturating nonlinearities. → internal covariate shift
- Batch Normalization: By performing normalization for each mini-batch, we can use much higher lr and be less careful about initialization. It also acts as a regularizer.
1. Introduction
- Stochastic gradient descent (SGD): At each step we consider a mini-batch, $m$, used to approximate the gradient of the loss function
- Using mini-batches, as opposed to one example is helpful:
- The gradient of the loss over a mini-batch $~$ the gradient over the training set; quality improves as the batch-size increases
- Computation over a batch is more efficient than $m$ individual computations, due to parallelism
- Limitations of SGD (? or any DNN model): Very sensitive to lr and initialization. Inputs to each layer are affected by the params of all preceding layers → Small changes to the network params amplify as the network becomes deeper. → Change in distribution of layers’ inputs present a problem
- General idea: Learning a system as a whole, is the same as training cascaded subn-networks. Therefore, the input properties that make training more efficient – such as having the same distribution between the training and test data – apply to training the sub-network as well.
- “Mitigate training burden”: If the input distribution remain fixed across sub-networks, then the params do not have to readjust to compensate for the change in the input distribution.
- “Training Acceleration”: If we can ensure that the distribution of nonlinearity inputs remain more stable, then the optimizer would be less likely to get wtuck in the saturated regime.
- Proposal: Batch Normalization to eliminate the change in the distributions of internal nodes of a deep network, referred to as Internal Covariate Shift
- Contributions
- Reduces internal covariate shift
- Reduces the dependence of gradients on the scale of params or their initial values. → higher lr without the risk of divergence → faster training
- Regularization effects: reduces the need for Dropout
- Makes it possible to use saturating nonlinearities
2. Towards Reducing Internal Covariate Shift
- Network training converges faster when inputs are whitened (have zero mean and unit variance).
- By fixing & whitening the distribution of every layer input, we expect to improve training speed.