## Description¶

Info

Parent class: Optimizer

Derived classes: -

The regular stochastic gradient descent and and its inertial variations (MomentumSGD, NesterovSGD) do not take into account the fact that some features can be extremely informative, but at the same time occur rarely (for example, in unbalanced datasets). However, it is worth clarifying that this is not only about the input parameters - the same rare features can also be found in deep representations of the convolution network, when the input parameters were “digested” by several layers.

The idea of an adaptive gradient approach is as follows: for each feature, it is necessary to select its own learning rate in such a way as to minimize the degree of updating the frequently occurring features and increase it for the rare ones.

Before presenting the information in an analytical form, let us remember the formula for SGD:

$$\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{l}\sum_{i=0}^{l} \nabla_{\theta}{J_i(\theta_t)}$$

To make it shorter we will omit the averaging of the sum and perform a substitution:

g_t = \nabla_{\theta}{J_i(\theta_t)}

Then for the i-th parameter $\theta$ the update will look as follows:

\theta_{t + 1, i} = \theta_{t, i} - \eta \cdot g_{t, i}

For adaptive gradient descent, the sum of squared updates for each model parameter $G_t$ is introduced:

G_t = G_t + g_t^2

is introduced $G_t$ - diagonal matrix, where each element at position $i,i$ - sum of the squared gradients for the $i$-th parameter.

Let us rewrite the formula for updating the $i$-th parameter $\theta$:

\theta_{t + 1, i} = \theta_{t, i} - \frac {\eta}{\sqrt{G_{t, ii} + \epsilon}} \cdot g_{t, i}

where

$\epsilon$ - smoothing parameter necessary to avoid division by 0 (usually $1e-8$).

In vector form (using the matrix multiplication $\odot$):

\theta_{t + 1} = \theta_t - \frac {\eta}{\sqrt{G_t + \epsilon}} \odot g_t

The main disadvantage of the algorithm is that during its operation, a continuous accumulation of squared gradients in the denominator occurs, since each new added member is positive. This leads to the learning rate becoming so small for some features that the algorithm is no longer able to continue additional studies of the surface of the target function. AdaDelta, RMSProp and Adam were proposed as a solution to the problem.

## Initializing¶

def __init__(self, learnRate=1e-3, epsilon=1e-8, nodeinfo=None):


Parameters

Parameter Allowed types Description Default
learnRate float Learning rate 1e-3
epsilon float Learning rate 1e-8
nodeinfo NodeInfo Object containing information about the computational node None

Explanations

-

## Examples¶

Necessary imports:

import numpy as np
from PuzzleLib.Backend import gpuarray


Info

gpuarray is required to properly place the tensor in the GPU.

Let us set up a synthetic training dataset:

data = gpuarray.to_gpu(np.random.randn(16, 128).astype(np.float32))
target = gpuarray.to_gpu(np.random.randn(16, 1).astype(np.float32))


Declaring the optimizer:

optimizer = AdaGrad(learnRate=0.01)


Suppose there is already some net network defined, for example, through Graph, then in order to install the optimizer on the network, we need the following:

optimizer.setupOn(net, useGlobalState=True)


Info

You can read more about optimizer methods and their parameters in the description of the Optimizer parent class

Moreover, let there be some loss error function, inherited from Cost, calculating its gradient as well. Then we get the implementation of the optimization process:

for i in range(100):
... predictions = net(data)
... error, grad = loss(predictions, target)