NesterovSGD¶
Description¶
This module implements the principle of modification of stochastic gradient descent proposed by Yu.E. Nesterov.
Actually, this method is an improved version of the inertial method, or, in other words, momentum stochastic gradient descent. According to Nesterov, a well-known idea from computational mathematics is added to the inertial method: looking ahead of the update vector.
Let us remember the formula for the inertial method (for parameter explanation, please see the documentation for MomentumSGD):
\begin{equation} g_t = \gamma{g_{t-1}} + (1 - \gamma)\eta\nabla_\theta J(\theta_t) \end{equation} \begin{equation} \theta_{t+1} = \theta_t - g_t \end{equation}
The idea is that since we are still going to shift by \gamma{g_{t-1}}, it makes sense to calculate the gradient of the loss function not at \theta_t, but at \theta_t - \gamma{g_{t-1}}:
\begin{equation} g_t = \gamma{g_{t-1}} + (1 - \gamma)\eta\nabla_\theta J(\theta_t - \gamma{g_{t-1}}) \end{equation} \begin{equation} \theta_{t+1} = \theta_t - g_t \end{equation}
This change allows you to "move" faster, if the derivative increases in the direction where we are heading, and slower, if vice versa.
Looking ahead” can play a trick with us if \gamma and \eta: are too large: we look so far ahead that we miss regions with the opposite sign of the gradient.
Initializing¶
def __init__(self, learnRate=1e-3, momRate=0.9, nodeinfo=None):
Parameters
Parameter | Allowed types | Description | Default |
---|---|---|---|
learnRate | float | Learning rate | 1e-3 |
momRate | float | Exponential decay rate | 0.9 |
nodeinfo | NodeInfo | Object containing information about the computational node | None |
Explanations
-
Examples¶
Necessary imports:
import numpy as np
from PuzzleLib.Optimizers import NesterovSGD
from PuzzleLib.Backend import gpuarray
Info
gpuarray
is required to properly place the tensor in the GPU.
is required to properly place the tensor in the GPU.
data = gpuarray.to_gpu(np.random.randn(16, 128).astype(np.float32))
target = gpuarray.to_gpu(np.random.randn(16, 1).astype(np.float32))
Declaring the optimizer:
optimizer = NesterovSGD(learnRate=0.01, momRate=0.85)
Suppose there is already some net
network defined, for example, through Graph, then in order to install the optimizer on the network, we need the following:
optimizer.setupOn(net, useGlobalState=True)
Info
You can read more about optimizer methods and their parameters in the description of the Optimizer parent class
Moreover, let there be some loss
error function, inherited from Cost, calculating its gradient as well. Then we get the implementation of the optimization process:
for i in range(100):
... predictions = net(data)
... error, grad = loss(predictions, target)
... optimizer.zeroGradParams()
... net.backward(grad)
... optimizer.update()
... if (i + 1) % 5 == 0:
... print("Iteration #%d error: %s" % (i + 1, error))