Adam¶

Description¶

Info

Parent class: Optimizer

Derived classes: -

Adam (adaptive moment estimation) - optimization algorithm that combines the inertia principles of MomentumSGD and adaptive parameter updating of AdaGrad and its modifications.

To implement the algorithm, it is necessary to use the approach of an exponentially decaying moving average for the gradients of the target function and their squares:

\begin{equation} m_t = \beta_1{m_{t-1}} + (1 - \beta_1)g_t \end{equation}

\begin{equation} \upsilon_t = \beta_2{\upsilon_{t-1}} + (1 - \beta_2)g_t^2 \end{equation}

where

$m_t$ - estimate of the first moment (average of gradients);
$\upsilon_t$ - estimate of the second moment (average non-centered dispersion of gradients).

However, in this definition there is the problem of the long accumulation of $m_t$ and $\upsilon_t$ at the beginning of the algorithm, especially when the decay rates $\beta_1$ and $\beta_2$ are close to $1$ .

In order to get rid of this problem and avoid introducing new hyperparameters, the estimates of the first and second moments are slightly modified:

\begin{equation} \hat{m_t} = \frac{m_t}{1 - \beta_1^t} \end{equation}

\begin{equation} \hat{\upsilon_t} = \frac{\upsilon_t}{1 - \beta_2^t} \end{equation}

Then the expression for parameter updating is:

$\theta_{t + 1} = \theta_t - \frac {\eta}{\sqrt{\hat{\upsilon_t}} + \epsilon} \hat{m_t}$

The authors of the algorithm recommend using the following parameter values: $\beta_1=0.9$ , $\beta_2=0.999$ , $\epsilon=1e-8$ .

Initializing¶

def __init__(self, alpha=1e-3, beta1=0.9, beta2=0.999, epsilon=1e-8, nodeinfo=None):

Parameters

Parameter	Allowed types	Description	Default
alpha	float	Step size	1e-3
beta1	float	Exponential decay rate of the 1^st momentum	0.9
beta2	float	Exponential decay rate of the 2^nd momentum	0.999
epsilon	float	Smoothing parameter	1e-6
nodeinfo	NodeInfo	Object containing information about the computational node	None

Explanations

-

Examples¶

Necessary imports:

import numpy as np
from PuzzleLib.Optimizers import Adam
from PuzzleLib.Backend import gpuarray

Info

gpuarray is required to properly place the tensor in the GPU.

Let us set up a synthetic training dataset:

data = gpuarray.to_gpu(np.random.randn(16, 128).astype(np.float32))
target = gpuarray.to_gpu(np.random.randn(16, 1).astype(np.float32))

Declaring the optimizer:

optimizer = Adam()

Suppose there is already some net network defined, for example, through Graph, then in order to install the optimizer on the network, we need the following:

optimizer.setupOn(net, useGlobalState=True)

Info

You can read more about optimizer methods and their parameters in the description of the Optimizer parent class

Moreover, let there be some loss error function, inherited from Cost, calculating its gradient as well. Then we get the implementation of the optimization process:

for i in range(100):
... predictions = net(data)
... error, grad = loss(predictions, target)

... optimizer.zeroGradParams()
... net.backward(grad)
... optimizer.update()

... if (i + 1) % 5 == 0:
...   print("Iteration #%d error: %s" % (i + 1, error))