Skip to content




Parent class: Optimizer

Derived classes: -

Adam (adaptive moment estimation) - optimization algorithm that combines the inertia principles of MomentumSGD and adaptive parameter updating of AdaGrad and its modifications.

To implement the algorithm, it is necessary to use the approach of an exponentially decaying moving average for the gradients of the target function and their squares:

\begin{equation} m_t = \beta_1{m_{t-1}} + (1 - \beta_1)g_t \end{equation}

\begin{equation} \upsilon_t = \beta_2{\upsilon_{t-1}} + (1 - \beta_2)g_t^2 \end{equation}


m_t - estimate of the first moment (average of gradients);
\upsilon_t - estimate of the second moment (average non-centered dispersion of gradients).

However, in this definition there is the problem of the long accumulation of m_t and \upsilon_t at the beginning of the algorithm, especially when the decay rates \beta_1 and \beta_2 are close to 1.

In order to get rid of this problem and avoid introducing new hyperparameters, the estimates of the first and second moments are slightly modified:

\begin{equation} \hat{m_t} = \frac{m_t}{1 - \beta_1^t} \end{equation}

\begin{equation} \hat{\upsilon_t} = \frac{\upsilon_t}{1 - \beta_2^t} \end{equation}

Then the expression for parameter updating is:

\theta_{t + 1} = \theta_t - \frac {\eta}{\sqrt{\hat{\upsilon_t}} + \epsilon} \hat{m_t}

The authors of the algorithm recommend using the following parameter values: \beta_1=0.9, \beta_2=0.999, \epsilon=1e-8.


def __init__(self, alpha=1e-3, beta1=0.9, beta2=0.999, epsilon=1e-8, nodeinfo=None):


Parameter Allowed types Description Default
alpha float Step size 1e-3
beta1 float Exponential decay rate of the 1st momentum 0.9
beta2 float Exponential decay rate of the 2nd momentum 0.999
epsilon float Smoothing parameter 1e-6
nodeinfo NodeInfo Object containing information about the computational node None




Necessary imports:

import numpy as np
from PuzzleLib.Optimizers import Adam
from PuzzleLib.Backend import gpuarray


gpuarray is required to properly place the tensor in the GPU.

Let us set up a synthetic training dataset:

data = gpuarray.to_gpu(np.random.randn(16, 128).astype(np.float32))
target = gpuarray.to_gpu(np.random.randn(16, 1).astype(np.float32))

Declaring the optimizer:

optimizer = Adam()

Suppose there is already some net network defined, for example, through Graph, then in order to install the optimizer on the network, we need the following:

optimizer.setupOn(net, useGlobalState=True)


You can read more about optimizer methods and their parameters in the description of the Optimizer parent class

Moreover, let there be some loss error function, inherited from Cost, calculating its gradient as well. Then we get the implementation of the optimization process:

for i in range(100):
... predictions = net(data)
... error, grad = loss(predictions, target)

... optimizer.zeroGradParams()
... net.backward(grad)
... optimizer.update()

... if (i + 1) % 5 == 0:
...   print("Iteration #%d error: %s" % (i + 1, error))