Pausing and resuming network training

Introduction

In this tutorial, we will learn how PuzzleLib library allows continuing training from a certain checkpoint. You may find the ResumeTrain.py script in the TestLib folder.

The ability to pause model training when necessary in order to proceed later is an important, though not challenging technical task. It might help you organize long training processes, save checkpoints, load and continue training from any of them and, as a result, experiment safely.

Before continuing reading, we strongly advise that you go through the Training the MNIST classifier, tutorial, if you have not already done that, since this material is based on it and does not cover some of the points that were already discussed in the MNIST tutorial

Training sample

Go to the Yann LeCun (Creator of the MNIST dataset) website and download the following files:

  • t10k-images.idx3-ubyte.gz
  • t10k-labels.idx1-ubyte.gz
  • train-images.idx3-ubyte.gz
  • train-labels.idx1-ubyte.gz

Place and unpack the downloaded files in the folder of your choice. The extracted files contain 70,000 black-and-white images, 28 by 28 pixels each, with handwritten numbers from 0 to 9.

Checking the script

Open the script ResumeTrain.py. Run, check that the script runs and reaches the end without crashing. If the script crashes, it means that you need to figure out the problem with the installation of PuzzleLib (you may not have Cuda working, or lack some Python library).

If you face no complications here, please proceed: we will go through the script contents step-by-step.

The script structure

Let us take a look at the structure of the script. Even though based on the Training the MNIST classifier, tutorial, it still has quite a few differences

The imports are same:

import os

import numpy as np

from PuzzleLib.Datasets import MnistLoader

from PuzzleLib.Containers import Sequential
from PuzzleLib.Modules import Conv2D, MaxPool2D, Activation, Flatten, Linear
from PuzzleLib.Modules.Activation import relu
from PuzzleLib.Handlers import Trainer, Validator
from PuzzleLib.Optimizers import MomentumSGD
from PuzzleLib.Cost import CrossEntropy

The first significant difference is that there is now a separate function for building the network architecture - buildNet:

def buildNet():
    net = Sequential()
    net.append(Conv2D(1, 16, 3))
    net.append(MaxPool2D())
    net.append(Activation(relu))

    net.append(Conv2D(16, 32, 4))
    net.append(MaxPool2D())
    net.append(Activation(relu))

    net.append(Flatten())
    net.append(Linear(32 * 5 * 5, 1024))
    net.append(Activation(relu))

    net.append(Linear(1024, 10))

    return net

The training process itself is also included in a separate function, called train:

def train(net, optimizer, data, labels, epochs):
    cost = CrossEntropy(maxlabels=10)
    trainer = Trainer(net, cost, optimizer)
    validator = Validator(net, cost)

    for i in range(epochs):
        trainer.trainFromHost(data[:60000], labels[:60000], macroBatchSize=60000,
                              onMacroBatchFinish=lambda tr: print("Train error: %s" % tr.cost.getMeanError()))
        print("Accuracy: %s" % (1.0 - validator.validateFromHost(data[60000:], labels[60000:], macroBatchSize=10000)))

        optimizer.learnRate *= 0.9
        print("Reduced optimizer learn rate to %s" % optimizer.learnRate)

Function arguments:

  • net - object that represents the network in the library (Sequential in our case);
  • optimizer - network optimizer that is an object of the Optimizer class from the Optimizers family;
  • data - data tensor of the np.ndarray format; in our case that would be a tensor of (N, C, H, W), where N - total number of images, C - channels of images (for MNIST images are b&w, i.e. single-channelled), H - height and W - width of images (28 and 28 respectively);
  • labels - vector of labels in np.ndarray format and N length for the corresponding images;
  • epochs - number of training epochs.

The last function would be main. Its beginning reminds of a similar function from the aforementioned tutorial:

def main():
    path = "../TestData/"
    mnist = MnistLoader()
    data, labels = mnist.load(path=path)
    data, labels = data[:], labels[:]
    print("Loaded mnist")

    np.random.seed(1234)

Important

Please do not forget to replace the path variable with the path where you have unpacked the archive with the dataset.

The next step would be creating a network and installing an optimizer on it:

    net = buildNet()

    optimizer = MomentumSGD()
    optimizer.setupOn(net, useGlobalState=True)
    optimizer.learnRate = 0.1
    optimizer.momRate = 0.9

Now we need to execute several training epochs - we have chosen 10:

    epochs = 10
    print("Training for %s epochs ..." % epochs)
    train(net, optimizer, data, labels, epochs)

After 10 epochs, the net and optimizer objects obtained some specific internal state (distribution of parameter values), which we want to fix in order to use in the future for further training:

    print("Saving net and optimizer ...")
    net.save(os.path.join(path, "net.hdf"))
    optimizer.save(os.path.join(path, "optimizer.hdf"))

Finally, once we decide to proceed with training, we simply load the parameters from files (in previously created objects of the corresponding classes):

    print("Reloading net and optimizer ...")
    net.load(os.path.join(path, "net.hdf"))
    optimizer.load(os.path.join(path, "optimizer.hdf"))

    print("Continuing training for %s epochs ..." % epochs)
    train(net, optimizer, data, labels, epochs)

Finally, once we decide to proceed with training, we simply load the parameters from files (in previously created objects of the corresponding classes):

    os.remove(os.path.join(path, "net.hdf"))
    os.remove(os.path.join(path, "optimizer.hdf"))