AdaGrad¶
Description¶
This module implements the adaptive gradient descent (AdaGrad).
The regular stochastic gradient descent and and its inertial variations (MomentumSGD, NesterovSGD) do not take into account the fact that some features can be extremely informative, but at the same time occur rarely (for example, in unbalanced datasets). However, it is worth clarifying that this is not only about the input parameters - the same rare features can also be found in deep representations of the convolution network, when the input parameters were “digested” by several layers.
The idea of an adaptive gradient approach is as follows: for each feature, it is necessary to select its own learning rate in such a way as to minimize the degree of updating the frequently occurring features and increase it for the rare ones.
Before presenting the information in an analytical form, let us remember the formula for SGD:
$$ \theta_{t+1} = \theta_t - \eta \cdot \frac{1}{l}\sum_{i=0}^{l} \nabla_{\theta}{J_i(\theta_t)} $$
To make it shorter we will omit the averaging of the sum and perform a substitution:
Then for the i-th parameter \theta the update will look as follows:
For adaptive gradient descent, the sum of squared updates for each model parameter G_t is introduced:
is introduced G_t - diagonal matrix, where each element at position i,i - sum of the squared gradients for the i-th parameter.
Let us rewrite the formula for updating the i-th parameter \theta:
where
\epsilon - smoothing parameter necessary to avoid division by 0 (usually 1e-8).
In vector form (using the matrix multiplication \odot):
The main disadvantage of the algorithm is that during its operation, a continuous accumulation of squared gradients in the denominator occurs, since each new added member is positive. This leads to the learning rate becoming so small for some features that the algorithm is no longer able to continue additional studies of the surface of the target function. AdaDelta, RMSProp and Adam were proposed as a solution to the problem.
Initializing¶
def __init__(self, learnRate=1e-3, epsilon=1e-8, nodeinfo=None):
Parameters
Parameter | Allowed types | Description | Default |
---|---|---|---|
learnRate | float | Learning rate | 1e-3 |
epsilon | float | Learning rate | 1e-8 |
nodeinfo | NodeInfo | Object containing information about the computational node | None |
Explanations
-
Examples¶
Necessary imports:
import numpy as np
from PuzzleLib.Optimizers import AdaGrad
from PuzzleLib.Backend import gpuarray
Info
gpuarray
is required to properly place the tensor in the GPU.
Let us set up a synthetic training dataset:
data = gpuarray.to_gpu(np.random.randn(16, 128).astype(np.float32))
target = gpuarray.to_gpu(np.random.randn(16, 1).astype(np.float32))
Declaring the optimizer:
optimizer = AdaGrad(learnRate=0.01)
Suppose there is already some net
network defined, for example, through Graph, then in order to install the optimizer on the network, we need the following:
optimizer.setupOn(net, useGlobalState=True)
Info
You can read more about optimizer methods and their parameters in the description of the Optimizer parent class
Moreover, let there be some loss
error function, inherited from Cost, calculating its gradient as well. Then we get the implementation of the optimization process:
for i in range(100):
... predictions = net(data)
... error, grad = loss(predictions, target)
... optimizer.zeroGradParams()
... net.backward(grad)
... optimizer.update()
... if (i + 1) % 5 == 0:
... print("Iteration #%d error: %s" % (i + 1, error))