Parent class: Optimizer

Derived classes: MomentumSGD, NesterovSGD

This module implements the stochastic gradient descent (stochastic gradient descent - SGD).

Let us assume there is some model determined by the set of parameters \theta (in the case of neural networks, model weights), and a training sample consisting of l + 1 object-response pairs {(x_0, y_0),...,(x_l, y_l)}. It is also necessary to determine the target function (loss function) in its general form J(\theta), or J(\theta)_i for the i-th dataset object.

Then the optimization task is formulated as follows: $$ \sum_{i=0}^{l} J(\theta) \to min_{\theta} $$

Then the optimization task is formulated as follows: $$ \theta_{t+1} = \theta_t - \eta \cdot \frac{1}{l}\sum_{i=0}^{l} \nabla_{\theta}{J_i(\theta_t)} $$


\theta_{t+1} - updated set of parameters for the next optimization step;
\theta_t - set of parameters at the current step;
\eta - learning rate;
\nabla_{\theta}J_i(\theta_t) - error function gradient for the i-th object of the training sample.

As a result, such an approach always leads convex error functions to their global minimum, and non-convex to a local minimum. However, as can be seen above, to perform only one operation of updating the parameters, we need to calculate the gradient for the entire training dataset, which can be very time- and resource-consuming.

Stochastic gradient descent exists in order to get rid of these problems, because in this modification the gradient descent is performed either on random sampling objects, i.e. the optimization of parameters occurs after calculating the gradient for one object, or by mini-batches - small sets of objects from the training dataset.

Let n be - size of the mini-batch, then the parameter optimization process will be as follows:

\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{n}\sum_{i=j}^{j + n} \nabla_{\theta}{J_i(\theta_t)}


j -index of a randomly taken element, j <= l - n


def __init__(self, learnRate=1e-3, nodeinfo=None):


Parameter Allowed types Description Default
learnRate float Learning rate 1e-3
nodeinfo NodeInfo Object containing information about the computational node None




Necessary imports:

>>> import numpy as np
>>> from PuzzleLib.Optimizers import SGD
>>> from PuzzleLib.Backend import gpuarray


gpuarray is required to properly place the tensor in the GPU.

Let us set up a synthetic training dataset:

>>> data = gpuarray.to_gpu(np.random.randn(16, 128).astype(np.float32))
>>> target = gpuarray.to_gpu(np.random.randn(16, 1).astype(np.float32))

Declaring the optimizer:

>>> optimizer = SGD(learnRate=0.01)

Suppose that there is already some net network defined, for example, through Graph, then in order to install the optimizer on the network, the following is required:

>>> optimizer.setupOn(net, useGlobalState=True)


You can read more about optimizer methods and their parameters in the description of the Optimizer parent class

Moreover, let there be some loss error function, inherited from Cost Cost, calculating its gradient as well. Then we get the implementation of the optimization process:

>>> for i in range(100):
...   predictions = net(data)
...   error, grad = loss(predictions, target)

...   optimizer.zeroGradParams()
...   net.backward(grad)
...   optimizer.update()

...   if (i + 1) % 5 == 0:
...     print("Iteration #%d error: %s" % (i + 1, error))