Parent class: Optimizer
Derived classes: -
RMSProp (root mean square propagation) - optimization algorithm developed in parallel with AdaDelta and at the same time its component. Both algorithms were created to solve the main problem of AdaGrad: uncontrolled accumulation of squared gradients, which ultimately led to paralysis of the learning process.
The idea of RMSProp is as follows: instead of the full amount of Gt updates, a history-averaged gradient square will be used. The method resembles the principle used in MomentumSGD - exponentially decaying moving average method.
Let us introduce the notation E[g^2]_t - which is the running average of the square of the gradient at time t. The formula for calculating it is as follows:
Then, if we insert E[g^2]_t in the formula for updating parameters for AdaGrad instead of G_t, the result will be the following (matrix operations are omitted to make the formula simple):
The denominator is the root mean square (RMS) of the gradients:
The recommended values are \eta = 0.001 and \gamma = 0.9.
def __init__(self, learnRate=1e-3, factor=0.1, epsilon=1e-5, nodeinfo=None):
|factor||float||Exponential decay rate||0.1|
|nodeinfo||NodeInfo||Object containing information about the computational node||None|
>>> import numpy as np >>> from PuzzleLib.Optimizers import RMSProp >>> from PuzzleLib.Backend import gpuarray
gpuarray is required to properly place the tensor in the GPU.
Let us set up a synthetic training dataset:
>>> data = gpuarray.to_gpu(np.random.randn(16, 128).astype(np.float32)) >>> target = gpuarray.to_gpu(np.random.randn(16, 1).astype(np.float32))
Declaring the optimizer:
>>> optimizer = RMSProp(learnRate=0.001, factor=0.9)
Suppose there is already some net network defined, for example, through Graph, then in order to install the optimizer on the network, we need the following:
>>> optimizer.setupOn(net, useGlobalState=True)
You can read more about optimizer methods and their parameters in the description of the Optimizer parent class
Moreover, let there be some
loss error function, inherited from Cost, calculating its gradient as well. Then we get the implementation of the optimization process:
>>> for i in range(100): ... predictions = net(data) ... error, grad = loss(predictions, target) ... optimizer.zeroGradParams() ... net.backward(grad) ... optimizer.update() ... if (i + 1) % 5 == 0: ... print("Iteration #%d error: %s" % (i + 1, error))