Skip to content




Parent class: SGD

Derived classes: -

This module implements the principle of modification of stochastic gradient descent proposed by Yu.E. Nesterov.

Actually, this method is an improved version of the inertial method, or, in other words, momentum stochastic gradient descent. According to Nesterov, a well-known idea from computational mathematics is added to the inertial method: looking ahead of the update vector.

Let us remember the formula for the inertial method (for parameter explanation, please see the documentation for MomentumSGD):

\begin{equation} g_t = \gamma{g_{t-1}} + (1 - \gamma)\eta\nabla_\theta J(\theta_t) \end{equation} \begin{equation} \theta_{t+1} = \theta_t - g_t \end{equation}

The idea is that since we are still going to shift by \gamma{g_{t-1}}, it makes sense to calculate the gradient of the loss function not at \theta_t, but at \theta_t - \gamma{g_{t-1}}:

\begin{equation} g_t = \gamma{g_{t-1}} + (1 - \gamma)\eta\nabla_\theta J(\theta_t - \gamma{g_{t-1}}) \end{equation} \begin{equation} \theta_{t+1} = \theta_t - g_t \end{equation}

This change allows you to "move" faster, if the derivative increases in the direction where we are heading, and slower, if vice versa.

Looking ahead” can play a trick with us if \gamma and \eta: are too large: we look so far ahead that we miss regions with the opposite sign of the gradient.


def __init__(self, learnRate=1e-3, momRate=0.9, nodeinfo=None):


Parameter Allowed types Description Default
learnRate float Learning rate 1e-3
momRate float Exponential decay rate 0.9
nodeinfo NodeInfo Object containing information about the computational node None




Necessary imports:

import numpy as np
from PuzzleLib.Optimizers import NesterovSGD
from PuzzleLib.Backend import gpuarray


gpuarray is required to properly place the tensor in the GPU.

is required to properly place the tensor in the GPU.

data = gpuarray.to_gpu(np.random.randn(16, 128).astype(np.float32))
target = gpuarray.to_gpu(np.random.randn(16, 1).astype(np.float32))

Declaring the optimizer:

optimizer = NesterovSGD(learnRate=0.01, momRate=0.85)

Suppose there is already some net network defined, for example, through Graph, then in order to install the optimizer on the network, we need the following:

optimizer.setupOn(net, useGlobalState=True)


You can read more about optimizer methods and their parameters in the description of the Optimizer parent class

Moreover, let there be some loss error function, inherited from Cost, calculating its gradient as well. Then we get the implementation of the optimization process:

for i in range(100):
... predictions = net(data)
... error, grad = loss(predictions, target)

... optimizer.zeroGradParams()
... net.backward(grad)
... optimizer.update()

... if (i + 1) % 5 == 0:
...   print("Iteration #%d error: %s" % (i + 1, error))