Automatic differentiation¶

Introduction¶

In a three tutorials series we will see what the depths of the library hide and what logic enables the "magic" of the neural networks training.

The back propagation is the root of training modern neural networks. What does that mean? Suppose there is a certain model defined by a set of parameters $\theta$ (in case of neural networks - the model weights), and a training sample consisting of $l + 1$ "object-response" pairs ${(x_0, y_0),...,(x_l, y_l)}$ . The target function (loss function) is also defined as $J(\theta)$ or $J(\theta)_i$ for the $i$ -th sample object.

A network with randomly initialized weights (that might actually imply quite different things, depending on the context, as there are various methods of weights initializing that directly affect the network training) runs through the input data of $x_i$ , getting the final response of $y_i^p$ , which is then compared with the real value of $y_i$ (we are talking about training with a trainer) using the selected loss function. It is obvious that our task is the optimization, i.e. it is necessary to select the values of network parameters that ensure minimal value of the loss function:

$\sum_{i=0}^{l} J(\theta) \to \min\limits_{\theta}$

Gradient descent is a powerful mechanism for solving this optimization problem. It works the following way: we calculate the gradient of the loss function and keep in mind that the gradient specifies the steepest function growth, so we here need to move on the surface of a multidimensional function of the parameters against the gradient to get to the function minimum:

$\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{l}\sum_{i=0}^{l} \nabla_{\theta}{J_i(\theta_t)}$

where

$\theta_{t+1}$ - an updated set of parameters for the next optimization step;
$\theta_t$ - a set of parameters in the current step;
$\eta$ - the learning rate;
$\nabla_{\theta}J_i(\theta_t)$ - the gradient of the loss function for the $i$ -th object of the training sample.

Sure, there are many subtle points and pitfalls, but we will not dwell on them at this point, as we need to focus on the general principle. Yet how will the gradient of the loss function affect not only the parameters of the output layer itself, but also the hidden ones, as well as the input layer (which is exactly what the backpropagation algorithm does)? How do you differentiate a function from a millions parameters and not lose your mind in the process? How is this implemented in PuzzleLib? The answers to these questions are in these three tutorials:

We will study a two-dimensional case for simplicity.