Pausing and resuming network training

Run in Google Colab Take a look on GitHub

Download notebook


In this tutorial, we will learn how PuzzleLib library allows continuing training from a certain checkpoint.

The ability to pause model training when necessary in order to proceed later is an important, though not challenging technical task. It might help you organize long training processes, save checkpoints, load and continue training from any of them and, as a result, experiment safely.

Before continuing reading, we strongly advise that you go through the Training the MNIST classifier, tutorial, if you have not already done that, since this material is based on it and does not cover some of the points that were already discussed in the MNIST tutorial

Training sample

Go to the Yann LeCun (Creator of the MNIST dataset) website and download the following files:

  • t10k-images.idx3-ubyte.gz
  • t10k-labels.idx1-ubyte.gz
  • train-images.idx3-ubyte.gz
  • train-labels.idx1-ubyte.gz

Place and unpack the downloaded files in the folder of your choice. The extracted files contain 70,000 black-and-white images, 28 by 28 pixels each, with handwritten numbers from 0 to 9.

Implementation in library tools

Let us take a look at the structure of the code. Even though based on the Training the MNIST classifier, tutorial, it still has quite a few differences.

The imports are same:

import numpy as np
import os

from PuzzleLib.Datasets import MnistLoader

from PuzzleLib.Models.Nets.LeNet import loadLeNet

from PuzzleLib.Containers import Sequential
from PuzzleLib.Modules import Conv2D, MaxPool2D, Activation, Flatten, Linear
from PuzzleLib.Modules.Activation import relu
from PuzzleLib.Handlers import Trainer, Validator
from PuzzleLib.Optimizers import MomentumSGD
from PuzzleLib.Cost import CrossEntropy
The library already implements the Lenet network and the next line shows how it can be called for further use.

#net = loadLeNet(None, initscheme=None)
But for a better understanding, let's implement all the layers ourselves. For convenience, we will move the construction of the network architecture into a separate function buildNet:

def buildNet():
    net = Sequential()
    net.append(Conv2D(1, 16, 3))

    net.append(Conv2D(16, 32, 4))

    net.append(Linear(32 * 5 * 5, 1024))

    net.append(Linear(1024, 10))

    return net

The training process itself is also included in a separate function, called train:

def train(net, optimizer, data, labels, epochs):
    cost = CrossEntropy(maxlabels=10)
    trainer = Trainer(net, cost, optimizer)
    validator = Validator(net, cost)

    for i in range(epochs):
        trainer.trainFromHost(data[:60000], labels[:60000], macroBatchSize=60000,
                              onMacroBatchFinish=lambda tr: print("Train error: %s" % tr.cost.getMeanError()))
        print("Accuracy: %s" % (1.0 - validator.validateFromHost(data[60000:], labels[60000:], macroBatchSize=10000)))

        optimizer.learnRate *= 0.9
        print("Reduced optimizer learn rate to %s" % optimizer.learnRate)

Function arguments:

  • net - object that represents the network in the library (Sequential in our case);
  • optimizer - network optimizer that is an object of the Optimizer class from the Optimizers family;
  • data - data tensor of the np.ndarray format; in our case that would be a tensor of (N, C, H, W), where

N - total number of images,

C - channels of images (for MNIST images are b&w, i.e. single-channelled),

H - height and W - width of images (28 and 28 respectively);
- labels - vector of labels in np.ndarray format and N length for the corresponding images;
- epochs - number of training epochs.

Download data:

path = "../TestData/"
mnist = MnistLoader()
data, labels = mnist.load(path=path)
data, labels = data[:], labels[:]
print("Loaded mnist")



Please do not forget to replace the path variable with the path where you have unpacked the archive with the dataset.

The next step would be creating a network and installing an optimizer on it:

net = buildNet()

optimizer = MomentumSGD()
optimizer.setupOn(net, useGlobalState=True)
optimizer.learnRate = 0.1
optimizer.momRate = 0.9

Now we need to execute several training epochs - we have chosen 10:

epochs = 10
print("Training for %s epochs ..." % epochs)
train(net, optimizer, data, labels, epochs)

After 10 epochs, the net and optimizer objects obtained some specific internal state (distribution of parameter values), which we want to fix in order to use in the future for further training:

print("Saving net and optimizer ..."), "net.hdf")), "optimizer.hdf"))

Finally, once we decide to proceed with training, we simply load the parameters from files (in previously created objects of the corresponding classes):

print("Reloading net and optimizer ...")
net.load(os.path.join(path, "net.hdf"))
optimizer.load(os.path.join(path, "optimizer.hdf"))

print("Continuing training for %s epochs ..." % epochs)
train(net, optimizer, data, labels, epochs)

At the end of the script, files created during its execution are cleaned:

os.remove(os.path.join(path, "net.hdf"))
os.remove(os.path.join(path, "optimizer.hdf"))