Pausing and resuming network training¶
In this tutorial, we will learn how
PuzzleLib library allows continuing training from a certain checkpoint.
The ability to pause model training when necessary in order to proceed later is an important, though not challenging technical task. It might help you organize long training processes, save checkpoints, load and continue training from any of them and, as a result, experiment safely.
Before continuing reading, we strongly advise that you go through the Training the MNIST classifier, tutorial, if you have not already done that, since this material is based on it and does not cover some of the points that were already discussed in the MNIST tutorial
Go to the Yann LeCun (Creator of the MNIST dataset) website and download the following files:
Place and unpack the downloaded files in the folder of your choice. The extracted files contain 70,000 black-and-white images, 28 by 28 pixels each, with handwritten numbers from 0 to 9.
Implementation in library tools¶
Let us take a look at the structure of the code. Even though based on the Training the MNIST classifier, tutorial, it still has quite a few differences.
The imports are same:
import numpy as np import os from PuzzleLib.Datasets import MnistLoader from PuzzleLib.Models.Nets.LeNet import loadLeNet from PuzzleLib.Containers import Sequential from PuzzleLib.Modules import Conv2D, MaxPool2D, Activation, Flatten, Linear from PuzzleLib.Modules.Activation import relu from PuzzleLib.Handlers import Trainer, Validator from PuzzleLib.Optimizers import MomentumSGD from PuzzleLib.Cost import CrossEntropy
#net = loadLeNet(None, initscheme=None)
def buildNet(): net = Sequential() net.append(Conv2D(1, 16, 3)) net.append(MaxPool2D()) net.append(Activation(relu)) net.append(Conv2D(16, 32, 4)) net.append(MaxPool2D()) net.append(Activation(relu)) net.append(Flatten()) net.append(Linear(32 * 5 * 5, 1024)) net.append(Activation(relu)) net.append(Linear(1024, 10)) return net
The training process itself is also included in a separate function, called
def train(net, optimizer, data, labels, epochs): cost = CrossEntropy(maxlabels=10) trainer = Trainer(net, cost, optimizer) validator = Validator(net, cost) for i in range(epochs): trainer.trainFromHost(data[:60000], labels[:60000], macroBatchSize=60000, onMacroBatchFinish=lambda tr: print("Train error: %s" % tr.cost.getMeanError())) print("Accuracy: %s" % (1.0 - validator.validateFromHost(data[60000:], labels[60000:], macroBatchSize=10000))) optimizer.learnRate *= 0.9 print("Reduced optimizer learn rate to %s" % optimizer.learnRate)
net- object that represents the network in the library (
Sequentialin our case);
optimizer- network optimizer that is an object of the Optimizer class from the Optimizers family;
data- data tensor of the np.ndarray format; in our case that would be a tensor of (N, C, H, W), where
N - total number of images,
C - channels of images (for MNIST images are b&w, i.e. single-channelled),
H - height and W - width of images (28 and 28 respectively);
labels - vector of labels in
np.ndarray format and N length for the corresponding images;
epochs - number of training epochs.
path = "../TestData/" mnist = MnistLoader() data, labels = mnist.load(path=path) data, labels = data[:], labels[:] print("Loaded mnist") np.random.seed(1234)
Please do not forget to replace the
path variable with the path where you have unpacked the archive with the dataset.
The next step would be creating a network and installing an optimizer on it:
net = buildNet() optimizer = MomentumSGD() optimizer.setupOn(net, useGlobalState=True) optimizer.learnRate = 0.1 optimizer.momRate = 0.9
Now we need to execute several training epochs - we have chosen 10:
epochs = 10 print("Training for %s epochs ..." % epochs) train(net, optimizer, data, labels, epochs)
After 10 epochs, the
optimizer objects obtained some specific internal state (distribution of parameter values), which we want to fix in order to use in the future for further training:
print("Saving net and optimizer ...") net.save(os.path.join(path, "net.hdf")) optimizer.save(os.path.join(path, "optimizer.hdf"))
Finally, once we decide to proceed with training, we simply load the parameters from files (in previously created objects of the corresponding classes):
print("Reloading net and optimizer ...") net.load(os.path.join(path, "net.hdf")) optimizer.load(os.path.join(path, "optimizer.hdf")) print("Continuing training for %s epochs ..." % epochs) train(net, optimizer, data, labels, epochs)
At the end of the script, files created during its execution are cleaned:
os.remove(os.path.join(path, "net.hdf")) os.remove(os.path.join(path, "optimizer.hdf"))