Network optimization

Introduction

In this tutorial, we will learn how PuzzleLib library allows neural networks optimization by fixing the dimension of the network input tensor. Such acceleration works with both learning and inference. The script called OptimizeNet.py is located in the TestLib folder of the PuzzleLib library. We will measure the training time on one batch of a non-optimized VGG-16 and an optimized one, and prove that such optimization really speeds up training.

Optimization

There are actually few other ways to optimize a network, which at the same cost can benefit in the network speed, either with minimal loss in quality, or none at all, but they can only be used to optimize the inference of the network:

  • Neural network graph optimization (fusion) - set of modules used is standardized (Conv, MaxPool, Activation, BatchNorm, etc.) in most popular neural network architectures. Since we know the network architecture after training it, we can optimize various combinations of modules from the architecture in terms of calculating the network inference;
  • Converting data for computing in neural networks to half precision numbers - usually data for computing in neural networks in Python has the float32 type, which means that each number is represented as a fractional number with a certain number of decimal places and occupies 4 bytes of RAM. Such precision is quite often not necessary after the decimal point. At the same time, multiplying large arrays, consisting of float32 elements, is a rather time-consuming operation. Transferring data from float32 to float16 reduces calculations without loss in the accuracy of the neural network inference. Each float16 number occupies 2 bytes of RAM. The best-case scenario is that this approach the makes acceleration approximately 2 times faster.
  • Quantization of data for computing in neural networks - similar to the previous point, data is converted from float32 into int8. At the same time, there are quite minor losses in the accuracy of the network inference. Each number with the int8 type occupies only 1 byte of RAM. The best-case scenario is that with this approach the inference is approximately 4 times faster.

You may learn about the application of these approaches in the tutorial Converting to TensorRT engine

Running the script

Open the script OptimizeNet.py located in the TestLib folder. Check that the script runs and reaches the end without crashing. If the script crashes, it means that you need to figure out the problem with the installation of PuzzleLib (you may not have Cuda working, or lack some Python library).

If you do not face any complications at this step, please proceed: we will go through the script contents step-by-step.

Start of the script

The script starts by importing the numpy library - the main Python library for working with multidimensional tensors.

import numpy as np

Next, we import all the classes and functions required from the PuzzleLib library. The next line imports the gpuarray backend helper function, which we need to place the tensors on the GPU correctly:

from PuzzleLib.Backend import gpuarray

Next, we import the timeKernel function, which tracks the execution time of the specified operation. The main parameters are the object of the operation for which the measurement is performed, and the number of iterations for measuring looplength:

from PuzzleLib.Backend.Benchmarks import timeKernel

This line imports the loading function of one of the architectures implemented in the library - VGG:

from PuzzleLib.Models.Nets.VGG import loadVGG

Here we load the modules that will train the neural network:

from PuzzleLib.Optimizers import SGD
from PuzzleLib.Cost import CrossEntropy
from PuzzleLib.Handlers import Trainer

Optimizers contain implementations of algorithms for training networks, and the costs store implementations of loss functions (cost of errors).

Handlers - auxiliary objects for training, validation, and standard neural network output calculations.

Next we go to the code of the main() function, which is called at the bottom of the script:

def main():
  net = loadVGG(None, "16")

Here we load the VGG family neural network architecture and initialize it.

The value of parameter 16 specifies the amount of network layers. PuzzleLib implements VGG with 11, 16, or 19 layers. Any other values for this parameter will lead to an error message displayed.

  batchsize = 16
  size = (batchsize, 3, 224, 224)

We define the size of the batch in the first line. The size of the batch defines how many objects (images) will be transmitted to the network for training at a time. Here we set the value to 16.

In the second line, we form a tuple containing the data dimensions fed to the neural network.

The first number of the tuple indicates the size of the batch, the second is the number of cards in the object (image) (depth), third and fourth are the height and width of the object respectively.

The object is a size tensor, and its counterpart in real conditions could be an image with identical dimensions (all images are represented as similar tensors).

This code block generates the training data:

  batch = np.random.normal(size=size).astype(dtype=np.float32)
  batch = gpuarray.to_gpu(batch)

We do not really need real data to measure the speed, we could just as well create artificial data of the desired format, which would be useless for network training (as no real problem is solved here), but makes sense if we want to measure the time of network training and save time for searching for a suitable dataset.

We form a batch of three-dimensional tensors with the dimensions defined in the previous step, using the np.random.normal function from the numpy library.

The np.random.normal function fills a tensor of a given dimension with random numbers of the normal distribution.

Once formed on the CPU, the to_gpu function transfers the batches to the GPU. All further calculations will take place on the GPU.

The class labels are generated for each batch object in a next block. The random.randint function from numpy generates numbers (that are class labels for objects) in the range from 0 to 999 and stores them in the np.int32 format:

  labels = np.random.randint(low=0, high=1000, size=(batchsize, ), dtype=np.int32)
  labels = gpuarray.to_gpu(labels)

Thus, we mapped each batch object to its own class (from 0 to 999) and simulated the formation of a real training sample. After synthesizing a set of responses, we transfer it to the GPU in a way similar to data transfer.

Preparing for network training

Now that we have the data and architecture, it is time to move on to the neural network training.

Let us choose stochastic gradient descent as an optimizer. The module that implements this training method is called SGD. We will run it on the initialized network, and it will recalculate all the network weights according to the formulas coded in this module:

  optimizer = SGD()
  optimizer.setupOn(net)

To train a network, you still need a function that evaluates how wrong the network was when classifying an input object. Let us initialize it:

  cost = CrossEntropy(maxlabels=1000)

This is a cross-entropy function. The maxlabels parameter is responsible for the number of network outputs. We consider this vector as a probability distribution over 1000 classes of objects (the probability that the object in question is of class 0, ... , the probability that the object in question is of class 999).

The correct answers for each object are known, so we can compare the network prediction with them. Cross-entropy performs this operation by producing a number that shows how wrong the network was.

Now we initialize the trainer.

  trainer = Trainer(net, cost, optimizer)

A trainer is an object that implements the training process, during which it calls the optimizer to recalculate the neural network weights in accordance with the cost function. You can read more about it here.

So far, we have only initialized the trainer, it does not do anything yet. To start the learning process, you need to call the special train method of this object (which is exactly what we are doing next).

Network training

We will not set the number of epochs, validation and test samples. Our batch is enough to meet our goal (comparing the non-optimized and optimized networks training time)..

We use the timeKernel function to track the execution time of the network training operation with its current architecture on a given amount of data. The length of the test cycle is set to 100 iterations. We also specify in the log text that this data was obtained before the network optimization:

  print("Started benchmarking %s ..." % net.name)
  timeKernel(
      trainer.train, args=(batch, labels), looplength=100, logname="Before optimizing %s" % net.name, normalize=True
  )

We use the function of optimizing network by its dimension. It is called in recursion. The iteration goes through each node (module, operation) of the network calculation graph and calls itself from the current shape data:

  net.optimizeForShape(size)

After the optimization, we run timeKernel again with the same parameters and compare time. In the log we specify that the data was received after the network optimization:

  timeKernel(
      trainer.train, args=(batch, labels), looplength=100, logname="After optimizing %s" % net.name, normalize=True
  )

An optimized network trains faster, because it runs one batch faster than a non-optimized network.

Now you have mastered the basic skill of optimizing neural networks!