## Description¶

Info

Parent class: Optimizer

Derived classes: -

To implement the algorithm, it is necessary to use the approach of an exponentially decaying moving average for the gradients of the target function and their squares:

$$m_t = \beta_1{m_{t-1}} + (1 - \beta_1)g_t$$

$$\upsilon_t = \beta_2{\upsilon_{t-1}} + (1 - \beta_2)g_t^2$$

where

$m_t$ - estimate of the first moment (average of gradients);
$\upsilon_t$ - estimate of the second moment (average non-centered dispersion of gradients).

However, in this definition there is the problem of the long accumulation of $m_t$ and $\upsilon_t$ at the beginning of the algorithm, especially when the decay rates $\beta_1$ and $\beta_2$ are close to $1$.

In order to get rid of this problem and avoid introducing new hyperparameters, the estimates of the first and second moments are slightly modified:

$$\hat{m_t} = \frac{m_t}{1 - \beta_1^t}$$

$$\hat{\upsilon_t} = \frac{\upsilon_t}{1 - \beta_2^t}$$

Then the expression for parameter updating is:

\theta_{t + 1} = \theta_t - \frac {\eta}{\sqrt{\hat{\upsilon_t}} + \epsilon} \hat{m_t}

The authors of the algorithm recommend using the following parameter values: $\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=1e-8$.

## Initializing¶

def __init__(self, alpha=1e-3, beta1=0.9, beta2=0.999, epsilon=1e-8, nodeinfo=None):


Parameters

Parameter Allowed types Description Default
alpha float Step size 1e-3
beta1 float Exponential decay rate of the 1st momentum 0.9
beta2 float Exponential decay rate of the 2nd momentum 0.999
epsilon float Smoothing parameter 1e-6
nodeinfo NodeInfo Object containing information about the computational node None

Explanations

-

## Examples¶

Necessary imports:

import numpy as np
from PuzzleLib.Backend import gpuarray


Info

gpuarray is required to properly place the tensor in the GPU.

Let us set up a synthetic training dataset:

data = gpuarray.to_gpu(np.random.randn(16, 128).astype(np.float32))
target = gpuarray.to_gpu(np.random.randn(16, 1).astype(np.float32))


Declaring the optimizer:

optimizer = Adam()


Suppose there is already some net network defined, for example, through Graph, then in order to install the optimizer on the network, we need the following:

optimizer.setupOn(net, useGlobalState=True)


Info

You can read more about optimizer methods and their parameters in the description of the Optimizer parent class

Moreover, let there be some loss error function, inherited from Cost, calculating its gradient as well. Then we get the implementation of the optimization process:

for i in range(100):
... predictions = net(data)
... error, grad = loss(predictions, target)