Skip to content




Parent class: Optimizer

Derived classes: -

RMSProp (root mean square propagation) - optimization algorithm developed in parallel with AdaDelta and at the same time its component. Both algorithms were created to solve the main problem of AdaGrad: uncontrolled accumulation of squared gradients, which ultimately led to paralysis of the learning process.

The idea of RMSProp is as follows: instead of the full amount of Gt updates, a history-averaged gradient square will be used. The method resembles the principle used in MomentumSGD - exponentially decaying moving average method.

Let us introduce the notation E[g^2]_t - which is the running average of the square of the gradient at time t. The formula for calculating it is as follows:

E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma)g_t^2

Then, if we insert E[g^2]_t in the formula for updating parameters for AdaGrad instead of G_t, the result will be the following (matrix operations are omitted to make the formula simple):

\theta_{t + 1} = \theta_t - \frac {\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t

The denominator is the root mean square (RMS) of the gradients:

RMS[g]_t = \sqrt{E[g^2]_t + \epsilon}

The recommended values are \eta = 0.001 and \gamma = 0.9.


def __init__(self, learnRate=1e-3, factor=0.1, epsilon=1e-5, nodeinfo=None):


Parameter Allowed types Description Default
learnRate float Learning rate 1e-3
factor float Exponential decay rate 0.1
epsilon float Smoothing parameter 1e-5
nodeinfo NodeInfo Object containing information about the computational node None




Necessary imports:

import numpy as np
from PuzzleLib.Optimizers import RMSProp
from PuzzleLib.Backend import gpuarray


gpuarray is required to properly place the tensor in the GPU.

Let us set up a synthetic training dataset:

data = gpuarray.to_gpu(np.random.randn(16, 128).astype(np.float32))
target = gpuarray.to_gpu(np.random.randn(16, 1).astype(np.float32))

Declaring the optimizer:

optimizer = RMSProp(learnRate=0.001, factor=0.9)

Suppose there is already some net network defined, for example, through Graph, then in order to install the optimizer on the network, we need the following:

optimizer.setupOn(net, useGlobalState=True)


You can read more about optimizer methods and their parameters in the description of the Optimizer parent class

Moreover, let there be some loss error function, inherited from Cost, calculating its gradient as well. Then we get the implementation of the optimization process:

for i in range(100):
... predictions = net(data)
... error, grad = loss(predictions, target)

... optimizer.zeroGradParams()
... net.backward(grad)
... optimizer.update()

... if (i + 1) % 5 == 0:
...   print("Iteration #%d error: %s" % (i + 1, error))