Parent class: Module

Derived classes: BatchNorm1D, BatchNorm2D, BatchNorm3D

General information

This module performs the N-dimensional batch normalization operation. The choice of the dimension of the operation depends on the dimension of the input data.

The batch normalization layer is intended, first of all, to solve the problem of covariate shift. The easiest way to understand the covariate shift is with an example: suppose there is a network that must recognize images of cats. The training sample contains images of only black cats, so when we try to process pictures of cats of colors other than black during the tests, the quality of the prediction of the model will be noticeably worse than on a black cat set. In other words, the covariate shift is a situation when the distributions of features in the training and test sets have different parameters (mean, variance, etc.).

When we talk about the covariate shift within the framework of deep learning, in particular, we mean the situation of different distribution of features not at the network input, as in the example above, but in the layers inside the model - the internal covariate shift. A neural network changes its weights with each mini-batch passed (if we use the appropriate optimization mechanism, of course), and since the outputs of the current layer are input features for the next one, each layer in the network gets into a situation where the distribution of input features changes every step, i.e. for every passed mini-batch.

The basic idea of batch normalization is to limit the internal covariate shift by normalizing the output of each layer, transforming them into distributions with zero mean and unit variance.

Figure 1 shows that the batch normalization gets the average and variance for the batch.

Виды нормализации
Figure 1. Demonstration of the principles of operation of various types of normalization

Let us consider the case of batch normalization of two-dimensional maps. Then the data tensor has the shape (N, C, H, W), where N - batch size, C - number of maps (channels), H - height of the map, W - width of the map. Let us agree on the indexes: t - number of the batch element, i - number of the map, m - number of the feature map element in height, n - number of the feature map element in width. Then for each individual i-th feature map:

\begin{equation} \mu_i = \frac{1}{NHW}\sum_{t=1}^{N} \sum_{m=1}^{H} \sum_{m=1}^{W}x_{timn} \end{equation}
\begin{equation} \sigma_i^2 = \frac{1}{NHW}\sum_{t=1}^{N} \sum_{m=1}^{H} \sum_{m=1}^{W}(x_{timn} - \mu_i)^2 \end{equation}
\begin{equation} \hat{x}_{timn} = \frac{x_{timn} - \mu_i}{\sqrt{\sigma_i^2 + \epsilon}} \end{equation}
\begin{equation} y_{timn} = \gamma\hat{x}_{timn} + \beta \end{equation}


\mu_i - mean of the distribution of features in the batch for the i-th feature map; \sigma_i^2 - mean of the distribution of features in the batch for the i-th feature map; x_{timn} - feature map element;
\hat{x}_{timn} - normalized feature map element;
\epsilon - stabilizing constant, preventing division by zero; \gamma - affine scale parameter;
\beta - affine shift parameter.

For the parameters \mu_i and \sigma_i^2 the layers of batch normalization remember the average value over the entire set during training. During inference, these parameters are frozen.

During inference, these parameters are frozen. In practice, the restriction of zero mean and unit variance can greatly limit the predictive ability of the network, so two more trained affine parameters are added: scale and shift, so that the algorithm can adjust the average and variance values for itself.

Additional sources

You can read more about batch normalization in the following sources:


def __init__(self, nd, maps, epsilon=1e-5, initFactor=1.0, minFactor=0.1, sscale=0.01, affine=True, name=None, empty=False, inplace=False):


Parameter Allowed types Descripition Default
nd int Dimension of operation -
size int Number of input features -
epsilon float Stabilizing constant 1e-5
initFactor float The initial factor value in the moving average 1.0
minFactor float The initial factor value in the moving average 0.1
sscale float Dispersion of the Gaussian distribution for the scale parameter of batch normalization 0.01
affine bool If True, the layer will have trainable affine parameters scale and bias True
name str Layer name None
empty bool If True, the tensors of the parameters of the module will not be initialized False
inplace bool If True, the output tensor will be written in memory in the place of the input tensor False


See derived classes