Parent class: Optimizer

Derived classes: -

AdaDelta (root mean square propagation) is an optimization algorithm that inherits the ideas laid in AdaGrad.

The main drawback of pure adaptive gradient descent is the uncontrolled accumulation of squared gradients, which leads to a constant decrease in the learning rate and, as a result, to a paralysis of the learning process itself.

The first principle of AdaDelta is that instead of the full amount of Gt updates, the gradient square averaged over history will be used. The method resembles the principle used in MomentumSGD - exponential moving average method.

Let us introduce a notation E[g^2]_t - moving average of the square of the gradient at the t time.

E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma)g_t^2

Then, when we insert E[g^2]_t in the formula for updating parameters for AdaGrad instead or G_t, we will get the following formula (matrix operations are omitted for simplicity):

\theta_{t + 1} = \theta_t - \frac {\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t

The denominator is the root mean square (RMS) of the gradients :

RMS[g]_t = \sqrt{E[g^2]_t + \epsilon}

Then the expression for the parameter update value is:

\Delta\theta_t = - \frac {\eta}{RMS[g]_t} g_t

Let us note that since the learning rate is nondimensional, the abstract units of measurement of this quantity and parameters do not coincide (however, the same applies to many other stochastic gradient descent algorithms). The next idea when developing AdaDelta was to bring the update value to the same units of measurement that the model parameters have.

This can be done by removing the learning rate from the formula. Let us define for this another moving average - the average of the squares of the values of the update parameters:

E[\Delta\theta^2]_t = \gamma E[\Delta\theta^2]_{t-1} + (1 - \gamma)\Delta\theta_t^2


RMS[\Delta\theta]_t = \sqrt{E[\Delta\theta^2]_t + \epsilon}

When we insert RMS[\Delta\theta]_t in the formula of updating the parameters, we get the expression for the AdaDelta algorithm:

\begin{equation} \Delta\theta_t = -\frac {RMS[\Delta\theta]_{t-1}} {RMS[g]_t} g_t \end{equation}

\begin{equation} \theta_{t + 1} = \theta_t + \Delta\theta_t \end{equation}

It is recommended to set the \gamma parameter to 0.9.


def __init__(self, rho=0.95, epsilon=1e-6, nodeinfo=None):


Parameter Allowed types Description Default
rho float Decay rate 0.95
epsilon float Smoothing parameter 1e-6
nodeinfo NodeInfo Object containing information about the computational node None




Necessary imports:

>>> import numpy as np
>>> from PuzzleLib.Optimizers import AdaDelta
>>> from PuzzleLib.Backend import gpuarray


gpuarray is required to properly place the tensor in the GPU.

Let us set up a synthetic training dataset:

>>> data = gpuarray.to_gpu(np.random.randn(16, 128).astype(np.float32))
>>> target = gpuarray.to_gpu(np.random.randn(16, 1).astype(np.float32))

Declaring the optimizer:

>>> optimizer = AdaDelta()

Suppose that there is already some net network defined, for example, through Graph, then in order to install the optimizer on the network, the following is required:

>>> optimizer.setupOn(net, useGlobalState=True)


You can read more about optimizer methods and their parameters in the description of the Optimizer parent class

Moreover, let there be some loss error function, inherited from Cost, calculating its gradient as well. Then we get the implementation of the optimization process:

>>> for i in range(100):
...   predictions = net(data)
...   error, grad = loss(predictions, target)

...   optimizer.zeroGradParams()
...   net.backward(grad)
...   optimizer.update()

...   if (i + 1) % 5 == 0:
...     print("Iteration #%d error: %s" % (i + 1, error))