Adam¶
Description¶
Adam (adaptive moment estimation) - optimization algorithm that combines the inertia principles of MomentumSGD and adaptive parameter updating of AdaGrad and its modifications.
To implement the algorithm, it is necessary to use the approach of an exponentially decaying moving average for the gradients of the target function and their squares:
\begin{equation} m_t = \beta_1{m_{t-1}} + (1 - \beta_1)g_t \end{equation}
\begin{equation} \upsilon_t = \beta_2{\upsilon_{t-1}} + (1 - \beta_2)g_t^2 \end{equation}
where
m_t - estimate of the first moment (average of gradients);
\upsilon_t - estimate of the second moment (average non-centered dispersion of gradients).
However, in this definition there is the problem of the long accumulation of m_t and \upsilon_t at the beginning of the algorithm, especially when the decay rates \beta_1 and \beta_2 are close to 1.
In order to get rid of this problem and avoid introducing new hyperparameters, the estimates of the first and second moments are slightly modified:
\begin{equation} \hat{m_t} = \frac{m_t}{1 - \beta_1^t} \end{equation}
\begin{equation} \hat{\upsilon_t} = \frac{\upsilon_t}{1 - \beta_2^t} \end{equation}
Then the expression for parameter updating is:
The authors of the algorithm recommend using the following parameter values: \beta_1=0.9, \beta_2=0.999, \epsilon=1e-8.
Initializing¶
def __init__(self, alpha=1e-3, beta1=0.9, beta2=0.999, epsilon=1e-8, nodeinfo=None):
Parameters
Parameter | Allowed types | Description | Default |
---|---|---|---|
alpha | float | Step size | 1e-3 |
beta1 | float | Exponential decay rate of the 1st momentum | 0.9 |
beta2 | float | Exponential decay rate of the 2nd momentum | 0.999 |
epsilon | float | Smoothing parameter | 1e-6 |
nodeinfo | NodeInfo | Object containing information about the computational node | None |
Explanations
-
Examples¶
Necessary imports:
import numpy as np
from PuzzleLib.Optimizers import Adam
from PuzzleLib.Backend import gpuarray
Info
gpuarray
is required to properly place the tensor in the GPU.
Let us set up a synthetic training dataset:
data = gpuarray.to_gpu(np.random.randn(16, 128).astype(np.float32))
target = gpuarray.to_gpu(np.random.randn(16, 1).astype(np.float32))
Declaring the optimizer:
optimizer = Adam()
Suppose there is already some net
network defined, for example, through Graph, then in order to install the optimizer on the network, we need the following:
optimizer.setupOn(net, useGlobalState=True)
Info
You can read more about optimizer methods and their parameters in the description of the Optimizer parent class
Moreover, let there be some loss
error function, inherited from Cost, calculating its gradient as well. Then we get the implementation of the optimization process:
for i in range(100):
... predictions = net(data)
... error, grad = loss(predictions, target)
... optimizer.zeroGradParams()
... net.backward(grad)
... optimizer.update()
... if (i + 1) % 5 == 0:
... print("Iteration #%d error: %s" % (i + 1, error))