Contents of the PuzzleLib module on the example of a linear layer


In this tutorial, we will learn how to write a new layer (module) for the PuzzleLib library. We will analyze in detail how the linear module

Mandatory methods

Module is the parent class for all modules. You must inherit a new layer from Module (or its subclasses) and use the following methods in order to implement it:

  1. __init__ – constructor. It initializes internal variables;
  2. updateData – method that performs inference, i.e. forward propagation;
  3. updateGrad – method that calculates the gradient on data (backprop for data);
  4. accGradParams – method that calculates the gradient for module parameters (backprop for parameters);
  5. dataShapeFrom – method that can take a form of the input data and estimate the size of the output data. For example, all activation modules leave the size unchanged, but MaxPool with stride and without padding reduces each tensor card by several times;
  6. checkDataShape – method that validates the size of the input data;
  7. gradShapeFrom – similar to the dataShapeFrom method, but is used to calculate the gradient size from the data size;
  8. checkGradShape – method that validates the gradient size.

In-depth theory

To get a better understanding of the updateGrad and accGradParams methods, we recommend that you study the tutorial The basics - automatic differentiationfirst.


Now we are going to see in more detail what happens in the constructor. The constructor parameters are described in the documentation for the Linear module.

    def __init__(self, insize, outsize, wscale=1.0, useBias=True, initscheme=None, name=None,
                 empty=False, transpose=False):

First, we call the parent constructor:


Next, we call the registerBlueprint method, which ensures that the Linear module supports serialization via Blueprint:


Storing layer parameters in the internal variables:

        self.transpose = transpose
        self.useBias = useBias

If the empty flag was True, we do not do anything else here, as we do not need to fill the layer with any initial values:

        if empty:

Next, we determine the dimensions of the weight matrix and the bias vector. If the transpose flag is True, the dimensions will be reversed::

        if transpose:
            wshape = (outsize, insize)
            bshape = (insize, )
            wshape = (insize, outsize)
            bshape = (outsize, )

The next step is initializing the weight matrices with random values, taking into account, which one was specified by initscheme. By default, the "xavier" method initializes the weights::

        self.W = None
        if initscheme == "none":
            self.setVar("W", Variable(gpuarray.empty(wshape, dtype=np.float32, allocator=memPool)))
            if initscheme == "xavier" or initscheme is None:
                if transpose:
                    nwscale = wscale / math.sqrt(outsize)
                    nwscale = wscale / math.sqrt(insize)
                W = np.random.uniform(-nwscale, nwscale, wshape).astype(np.float32)
            elif initscheme == "he":
                if transpose:
                    nwscale = wscale * math.sqrt(2.0 / outsize)
                    nwscale = wscale * math.sqrt(2.0 / insize)
                W = np.random.normal(0.0, nwscale, wshape).astype(np.float32)
            elif initscheme == "gaussian":
                nwscale = wscale
                W = np.random.normal(0.0, nwscale, wshape).astype(np.float32)
            elif initscheme == "uniform":
                nwscale = wscale
                W = np.random.uniform(-nwscale, nwscale, wshape).astype(np.float32)
                raise ValueError("Unsupported init scheme")

            self.setVar("W", Variable(gpuarray.to_gpu(W, allocator=memPool)))


The following initialization schemes are supported: "xavier", "he", "gaussian", "uniform", "none". There are good articles about them on the Internet. For instance, you might check this to learn about initializing Xavier

Finally, we initialize the biases:

        self.b = None
        if useBias:
            self.setVar("b", Variable(gpuarray.zeros(bshape, dtype=np.float32, allocator=memPool)))


In general, anything that happens inside the constructor depends on the module, but the general pattern implies writing the internal variables and initializing the weights.


Implements forward propagation – inference.

It is a very short method for the Linear module:

    def updateData(self, data):
        if not self.transpose:
   = CuBlas.mulMatrixOnMatrix(data, self.W)
   = CuBlas.mulMatrixOnMatrix(data, self.W, transpB=True)

        if self.useBias:
            MatVec.addVecToMat(self.b,, axis=1,

It uses CuBlas, the linear algebra library in CUDA, so it is quite simple. The input data matrix and the weight matrix are multiplied, after which the bias vector is added to the result. When you implement your own module, this would be where you need to do all the math for inference.


It is not necessary to implement all the calculations on the GPU at once. You might first build a prototype via numpy, and then rewrite the code for calculations on the GPU


Как и с updateData, здесь используется CuBlas:

    def updateGrad(self, grad):
        if not self.transpose:
            self.grad = CuBlas.mulMatrixOnMatrix(grad, self.W, transpB=True)
            self.grad = CuBlas.mulMatrixOnMatrix(grad, self.W)

The gradient vector is multiplied by the transposed weight matrix. Check The basics - automatic differentiation see why it is the to case.


It calculates the gradient for module parameters (backprop for parameters). In the case of a linear layer, this is literally multiplying the transposed input matrix by the gradient on the right:

    def accGradParams(self, grad, scale=1.0, momentum=0.0):
        if not self.transpose:
            CuBlas.mulMatrixOnMatrix(self.inData, grad, out=self.vars["W"].grad, transpA=True,
                                     alpha=scale, beta=momentum)
            CuBlas.mulMatrixOnMatrix(grad, self.inData, out=self.vars["W"].grad, transpA=True,
                                     alpha=scale, beta=momentum)

When implementing your module, you will need to figure out how to calculate the gradient on the layer weight, and implement it on numpy first, and only then on CUDA/OpenCL.

Different optimization algorithms use the scale and momentum parameters. A general SGD does not use scale and does not accumulate a gradient from the past iterations with the help of momentum.

Next, if required, the gradient on the bias is calculated. This is merely summing the gradient by the columns:

        if self.useBias:
            CuBlas.sumOnMatrix(grad, out=self.vars["b"].grad, alpha=scale, beta=momentum)


It is a very simple method: from the size of the input data it estimates which data will be obtained at the output and returns it.

    def dataShapeFrom(self, shape):
        if not self.transpose:
            return shape[0], self.W.shape[1]
            return shape[0], self.W.shape[0]


Another simple method. It receives shape size and checks whether it is the correct size for the input data.

    def checkDataShape(self, shape):
        if len(shape) != 2:
            raise ValueError("Data must be 2d matrix")

        if not self.transpose:
            if shape[1] != self.W.shape[0]:
                raise ValueError("Expected %d data dimensions, %d were given" % (self.W.shape[0], shape[1]))
            if shape[1]!= self.W.shape[1]:
                raise ValueError("Expected %d data dimensions, %d were given" % (self.W.shape[1], shape[1]))


Method reverse to the dataShapeFrom:estimates what data will be received at the input during the backprop from the size of the output data.

    def gradShapeFrom(self, shape):
        if not self.transpose:
            return shape[0], self.W.shape[0]
            return shape[0], self.W.shape[1]


Similar to checkDataShape, it checks whether the gradient size is correct.

    def checkGradShape(self, shape):
        if len(shape) != 2:
            raise ValueError("Grad must be 2d matrix")

        if not self.transpose:
            if shape[1] != self.W.shape[1]:
                raise ValueError("Expected %d grad dimensions, %d were given" % (self.W.shape[1], shape[1]))
            if shape[1] != self.W.shape[0]:
                raise ValueError("Expected %d grad dimensions, %d were given" % (self.W.shape[0], shape[1]))

Unit tests

It is vital to write unit tests: there the calculations must be implemented on the CPU, and the results must be compared with the calculations on the GPU. Checking the calculations on random tensors of random sizes is strongly recommended.

Make sure to test forward and backward calculations. In Linear this is done by the calcTest and trainTest functions.


We will need some "fake" training data to check the correctness of the forward calculations: data and target (i.e. data and labels). Now we are generating and sending them to the GPU:

def calcTest():
    insize = 5
    outsize = 1

    data = gpuarray.to_gpu(np.random.normal(0.0, 0.01, (5, insize)).astype(np.float32))
    target = gpuarray.to_gpu(np.random.normal(0.0, 0.01, (5, outsize)).astype(np.float32))

After that, we build a linear layer of the desired size and import a loss function that will compare the calculations result with target:

    linear = Linear(insize, outsize)

    from PuzzleLib.Cost.MSE import MSE
    mse = MSE()

We pass the data through it:


Calculating the error and gradient:

    error, grad = mse(, target)

Creating a backprop:


Now we create a forward and backward on the CPU via numpy:

    hostOutData =, linear.W.get()) + linear.b.get()[np.newaxis, :]
    hostInGrad =, linear.W.get().T)

    hostWGrad =, grad.get())
    hostBGrad = np.sum(grad.get(), axis=0)

Comparing the results of GPU and CPU calculations via the np.allclose function, which allows a small margin of error in calculations:

    assert np.allclose(hostOutData,
    assert np.allclose(hostInGrad, linear.grad.get())
    assert np.allclose(hostWGrad, linear.vars["W"].grad.get())
    assert np.allclose(hostBGrad, linear.vars["b"].grad.get())
If something had not worked right, the test would have fallen into assert.


Other, more complex modules may require more tests with more diverse data see the convolutional layer.


We check if the backprop works in this test.

def trainTest():
    insize = 500
    outsize = 100

    data = gpuarray.to_gpu(np.random.normal(0.0, 1.0, (32, insize)).astype(np.float32))
    target = gpuarray.to_gpu(np.random.normal(0.0, 1.0, (32, outsize)).astype(np.float32))

    linear = Linear(insize, outsize)

    from PuzzleLib.Cost.MSE import MSE
    mse = MSE()

    for i in range(100):
        learnRate = 1e-4

        error, grad = mse(, target)


        if (i+1) % 5 == 0:
            print("Iteration #%d error: %s" % (i+1, error))


In general, this is it. Learn how other, more complex modules are implemented, and then it will be easier for you to implement your own new modules as well.