AdaDelta¶
Description¶
AdaDelta (root mean square propagation) is an optimization algorithm that inherits the ideas laid in AdaGrad.
The main drawback of pure adaptive gradient descent is the uncontrolled accumulation of squared gradients, which leads to a constant decrease in the learning rate and, as a result, to a paralysis of the learning process itself.
The first principle of AdaDelta is that instead of the full amount of Gt updates, the gradient square averaged over history will be used. The method resembles the principle used in MomentumSGD - exponential moving average method.
Let us introduce a notation E[g^2]_t - moving average of the square of the gradient at the t time.
Then, when we insert E[g^2]_t in the formula for updating parameters for AdaGrad instead or G_t, we will get the following formula (matrix operations are omitted for simplicity):
The denominator is the root mean square (RMS) of the gradients :
Then the expression for the parameter update value is:
Let us note that since the learning rate is nondimensional, the abstract units of measurement of this quantity and parameters do not coincide (however, the same applies to many other stochastic gradient descent algorithms). The next idea when developing AdaDelta was to bring the update value to the same units of measurement that the model parameters have.
This can be done by removing the learning rate from the formula. Let us define for this another moving average - the average of the squares of the values of the update parameters:
Then:
When we insert RMS[\Delta\theta]_t in the formula of updating the parameters, we get the expression for the AdaDelta algorithm:
\begin{equation} \Delta\theta_t = -\frac {RMS[\Delta\theta]_{t-1}} {RMS[g]_t} g_t \end{equation}
\begin{equation} \theta_{t + 1} = \theta_t + \Delta\theta_t \end{equation}
It is recommended to set the \gamma parameter to 0.9.
Initializing¶
def __init__(self, rho=0.95, epsilon=1e-6, nodeinfo=None):
Parameters
Parameter | Allowed types | Description | Default |
---|---|---|---|
rho | float | Decay rate | 0.95 |
epsilon | float | Smoothing parameter | 1e-6 |
nodeinfo | NodeInfo | Object containing information about the computational node | None |
Explanations
-
Examples¶
Necessary imports:
import numpy as np
from PuzzleLib.Optimizers import AdaDelta
from PuzzleLib.Backend import gpuarray
Info
gpuarray
is required to properly place the tensor in the GPU.
Let us set up a synthetic training dataset:
data = gpuarray.to_gpu(np.random.randn(16, 128).astype(np.float32))
target = gpuarray.to_gpu(np.random.randn(16, 1).astype(np.float32))
Declaring the optimizer:
optimizer = AdaDelta()
Suppose that there is already some net
network defined, for example, through Graph, then in order to install the optimizer on the network, the following is required:
optimizer.setupOn(net, useGlobalState=True)
Info
You can read more about optimizer methods and their parameters in the description of the Optimizer parent class
Moreover, let there be some loss
error function, inherited from Cost, calculating its gradient as well. Then we get the implementation of the optimization process:
for i in range(100):
... predictions = net(data)
... error, grad = loss(predictions, target)
... optimizer.zeroGradParams()
... net.backward(grad)
... optimizer.update()
... if (i + 1) % 5 == 0:
... print("Iteration #%d error: %s" % (i + 1, error))