RMSProp¶

Description¶

Info

Parent class: Optimizer

Derived classes: -

RMSProp (root mean square propagation) - optimization algorithm developed in parallel with AdaDelta and at the same time its component. Both algorithms were created to solve the main problem of AdaGrad: uncontrolled accumulation of squared gradients, which ultimately led to paralysis of the learning process.

The idea of RMSProp is as follows: instead of the full amount of Gt updates, a history-averaged gradient square will be used. The method resembles the principle used in MomentumSGD - exponentially decaying moving average method.

Let us introduce the notation $E[g^2]_t$ - which is the running average of the square of the gradient at time $t$ . The formula for calculating it is as follows:

$E[g^2]_t = \gamma E[g^2]_{t-1} + (1 - \gamma)g_t^2$

Then, if we insert $E[g^2]_t$ in the formula for updating parameters for AdaGrad instead of $G_t$ , the result will be the following (matrix operations are omitted to make the formula simple):

$\theta_{t + 1} = \theta_t - \frac {\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t$

The denominator is the root mean square (RMS) of the gradients:

$RMS[g]_t = \sqrt{E[g^2]_t + \epsilon}$

The recommended values are $\eta = 0.001$ and $\gamma = 0.9$ .

Initializing¶

def __init__(self, learnRate=1e-3, factor=0.1, epsilon=1e-5, nodeinfo=None):

Parameters

Parameter	Allowed types	Description	Default
learnRate	float	Learning rate	1e-3
factor	float	Exponential decay rate	0.1
epsilon	float	Smoothing parameter	1e-5
nodeinfo	NodeInfo	Object containing information about the computational node	None

Explanations

-

Examples¶

Necessary imports:

import numpy as np
from PuzzleLib.Optimizers import RMSProp
from PuzzleLib.Backend import gpuarray

Info

gpuarray is required to properly place the tensor in the GPU.

Let us set up a synthetic training dataset:

data = gpuarray.to_gpu(np.random.randn(16, 128).astype(np.float32))
target = gpuarray.to_gpu(np.random.randn(16, 1).astype(np.float32))

Declaring the optimizer:

optimizer = RMSProp(learnRate=0.001, factor=0.9)

Suppose there is already some net network defined, for example, through Graph, then in order to install the optimizer on the network, we need the following:

optimizer.setupOn(net, useGlobalState=True)

Info

You can read more about optimizer methods and their parameters in the description of the Optimizer parent class

Moreover, let there be some loss error function, inherited from Cost, calculating its gradient as well. Then we get the implementation of the optimization process:

for i in range(100):
... predictions = net(data)
... error, grad = loss(predictions, target)

... optimizer.zeroGradParams()
... net.backward(grad)
... optimizer.update()

... if (i + 1) % 5 == 0:
...   print("Iteration #%d error: %s" % (i + 1, error))