Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

2024-02-29


Keywords: #Batch Normalization


0. Abstract

  • Training DNN is complicated because the distribution of each layer’s inputs changes during training. → Requires lower learning rate, careful param initialization, leads to saturating nonlinearities. → internal covariate shift
  • Batch Normalization: By performing normalization for each mini-batch, we can use much higher lr and be less careful about initialization. It also acts as a regularizer.

1. Introduction

  • Stochastic gradient descent (SGD): At each step we consider a mini-batch, $m$, used to approximate the gradient of the loss function
  • Using mini-batches, as opposed to one example is helpful:
    • The gradient of the loss over a mini-batch $~$ the gradient over the training set; quality improves as the batch-size increases
    • Computation over a batch is more efficient than $m$ individual computations, due to parallelism
  • Limitations of SGD (? or any DNN model): Very sensitive to lr and initialization. Inputs to each layer are affected by the params of all preceding layers → Small changes to the network params amplify as the network becomes deeper. → Change in distribution of layers’ inputs present a problem
  • General idea: Learning a system as a whole, is the same as training cascaded subn-networks. Therefore, the input properties that make training more efficient – such as having the same distribution between the training and test data – apply to training the sub-network as well.
    • “Mitigate training burden”: If the input distribution remain fixed across sub-networks, then the params do not have to readjust to compensate for the change in the input distribution.
    • “Training Acceleration”: If we can ensure that the distribution of nonlinearity inputs remain more stable, then the optimizer would be less likely to get wtuck in the saturated regime.
  • Proposal: Batch Normalization to eliminate the change in the distributions of internal nodes of a deep network, referred to as Internal Covariate Shift
  • Contributions
    • Reduces internal covariate shift
    • Reduces the dependence of gradients on the scale of params or their initial values. → higher lr without the risk of divergence → faster training
    • Regularization effects: reduces the need for Dropout
    • Makes it possible to use saturating nonlinearities

2. Towards Reducing Internal Covariate Shift

  • Network training converges faster when inputs are whitened (have zero mean and unit variance).
  • By fixing & whitening the distribution of every layer input, we expect to improve training speed.