Speech recognition with Wav2Letter

In this tutorial, we will learn how to use the PuzzleLib library to build a speech recognition system. We will train the Wav2Letter neural network already collected on PuzzleLib, using the open LibriSpeech data set.

Data preparation

In order to start model training, you need to prepare special csv files (manifests) for training, validation, and test data. A line in the manifest file contains the path to the audio file and its transcript.

The script PrepareLibrispeech.py allows you to download the LibriSpeech dataset, convert audio files from flac to wav, and create 6 manifest files: two of each (clear and noisier data) for training, validation, and testing. The lines of manifest files are sorted by the length of the corresponding audio recordings. This helps to train the model more efficiently. The script arguments are the full path for saving data dataDir and the path for saving manifest files manifestDir.

Extracting features

There are different ways to extract features from audio data. We use normalized spectrograms, since they allow describing the data in quite detail. Hyperparameters, as well as a token dictionary, are set in the LibriConfig class of the Config.py file. The labels token dictionary consists of the Latin alphabet, an apostrophe, a space, and a dummy character that represents the letter's omission.

class LibriConfig:
    sampleRate = 16000
    windowSize = 0.02
    windowStride = 0.01
    window = 'hamming'
    labels = "_'abcdefghijklmnopqrstuvwxyz "

The transformations applied to audio are implemented in the preprocess function of the Data/dataLoader.py script. The audio is read at a specific signal sampling, sampleRate, and then short-time Fourier transform is applied to the signal with the specified window size windowSize, step windowStride, and window function type window. The real part of the received complex-valued spectrogram is extracted. The described methods are implemented in the librosa library.

def preprocess(audioPath, sampleRate, windowSize, windowStride, window):
    y = loadAudio(audioPath)

    nFft = int(sampleRate * windowSize)
    winLength = nFft
    hopLength = int(sampleRate * windowStride)

    D = librosa.stft(y, n_fft=nFft, hop_length=hopLength, win_length=winLength, window=window)

    spect, phase = librosa.magphase(D)
Finally, the per-channel energy normalization implemented in the pcen function is applied to the spectrogram. This is a popular audio normalization technology in speech recognition. It allows reducing the background noise and emphasizing the foreground sound.
    pcenResult = pcen(E=spect, sr=sampleRate, hopLength=hopLength)
    return pcenResult
The SpectrogramDataset class is initialized by the manifest file and config. It describes an audio dataset with transcripts, retrieves audio file paths and transcriptions from the manifest file, creates a labelsMap token dictionary and sets audio data processing hyperparameters.
class SpectrogramDataset(object):
    def __init__(self, manifestFilePath, config):
        self.manifestFilePath = manifestFilePath
        with open(self.manifestFilePath) as f:
            ids = f.readlines()
        self.ids = [x.strip().split(',') for x in ids]
        self.size = len(ids)
        self.labelsMap = dict([(config.labels[i], i) for i in range(len(config.labels))])
        self.sampleRate = config.sampleRate
        self.windowSize = config.windowSize
        self.windowStride = config.windowStride
        self.window = config.window
The __getitem__ method returns the spectrogram calculated via the preprocess function, the transcript translated into a sequence of token indexes, the path to the audio file, and the text transcript itself.
    def __getitem__(self, index):
        sample = self.ids[index]
        audioPath, transcriptLoaded = sample[0], sample[1]
        spect = preprocess(audioPath, self.sampleRate, self.windowSize, self.windowStride, self.window)

        transcript = list(filter(None, [self.labelsMap.get(x) for x in list(transcriptLoaded)]))
        return spect, transcript, audioPath, transcriptLoaded
The BucketingSampler class is implemented to agregate data in batches. It shuffles data indexes, if necessary, and groups them into bins. The size of the bin corresponds to the size of the batch.
class BucketingSampler(object):
    def __init__(self, dataSource, batchSize=1, shuffle=False):    
        self.dataSource = dataSource
        self.batchSize = batchSize
        self.ids = list(range(0, len(dataSource)))
        self.shuffle = shuffle
        self.reset()

    ...

    def getBins(self):
        if self.shuffle:
            np.random.shuffle(self.ids)
        self.bins = [self.ids[i:i + self.batchSize] for i in range(0, len(self.ids), self.batchSize)]


    def reset(self):
        self.getBins()
        self.batchId = 0
The DataLoader class uses these index groups to generate batches iteratively.
class DataLoader(object):        
    def __init__(self, dataset, batchSampler):
        self.dataset = dataset
        self.batchSampler = batchSampler
        self.sampleIter = iter(self.batchSampler)


    def __next__(self):
        try:
            indices = next(self.sampleIter)
            indices = [i for i in indices][0]
            batch = getBatch([self.dataset[i] for i in indices])        
            return batch
        except:
            raise StopIteration()
The getBatch function is used to generate batches. The features extracted from audio signals are reduced to one maximum length via a wrap padding and are aggregated into a three-dimensional inputs tensor. That is why we have sorted the audio in manifest files by length, since aggregating audio signals by length into a single batch is efficient. The relative lengths of the signals are recorded in the inputPercentages.
        seqLength = tensor.shape[1]
        tensorNew = np.pad(tensor, ((0, 0), (0, abs(seqLength-maxSeqlength))), 'wrap')
        inputs[x] = tensorNew
        inputPercentages[x] = seqLength / float(maxSeqlength)
Sequences of transcription token indexes are combined into one targets, and the lengths of the original sequences are stored in targetSizes. These variables form a batch together with the path to the audio file and the transcript.
        targetSizes[x] = len(target)
        targets.extend(target)
        inputFilePathAndTranscription.append([tensorPath, orignalTranscription])

    targets = np.array(targets)
    return inputs, inputPercentages, targets, targetSizes, inputFilePathAndTranscription

Speech recognition system

The speech recognition system consists of two modules: acoustic model and decoding. The acoustic model establishes a relationship between the audio signal and the probability distribution of tokens in it. Here we have a convolutional neural network that receives an input of a sequence of features extracted from an audio signal, and outputs a sequence of token probability vectors in it. Which means that the model predicts the probabilities of tokens on each fixed time window. This is what happens if you take tokens from the output of the acoustic model with the maximum probability: ______________h_eee__l_l__ooooo_ _w__oor__lll_dd__. Decoding converts a sequence of token probability vectors into the text hello world.

Acoustic model training and validation on the corresponding manifest files are implemented in the Train.py script. The script arguments are:

  • paths to training and validation manifest files, if None is received, training or validation will be skipped
    parser.add_argument('--trainManifest', metavar='DIR',
                        help='path to train manifest csv', default='Data/train-clean.csv')
    parser.add_argument('--valManifests', metavar='DIR',
                        help='path to validation manifest csv', default=['Data/dev-clean.csv'])
    
  • Batch size, number of learning epochs, learning rate
    parser.add_argument('--batchSize', default=8, type=int, help='Batch size for training')
    parser.add_argument('--epochs', default=100, type=int, help='Number of training epochs')
    parser.add_argument('--lr', '--learningRate', default=1e-5, type=float, help='initial learning rate')
    
  • arguments responsible for defining the saving frequence for acoustic model weights, prefix of weight file names and path to the directory weights are saved
    parser.add_argument('--checkpointPerBatch', default=3000, type=int, help='Save checkpoint per batch. 0 means never save')
    parser.add_argument('--checkpointName', default='w2l', type=str, help='Name of checkpoints')
    parser.add_argument('--saveFolder', default='Checkpoints/', help='Location to save epoch models')
    
  • path to the acoustic model weights to initialize it with
    parser.add_argument('--continueFrom', default=None, help='Continue from checkpoint model')
    
    Training and validation data loaders are initialized with the previously described data handling classes and the getDataLoader function: trainLoader and valLoaders.
    from Data.dataLoader import DataLoader, SpectrogramDataset, BucketingSampler
    ...
    def getDataLoader(manifestFilePath, config, batchSize):
        dataset = SpectrogramDataset(manifestFilePath=manifestFilePath, config=config)
        sampler = BucketingSampler(dataset, batchSize=batchSize)
        return DataLoader(dataset, batchSampler=sampler)
    

Training the acoustic mode

The acoustic model neural network architecture we use is inspired by the Wav2Letter technology. The advantage of this acoustic model is that it consists entirely of convolutional layers, which leads to more efficient calculations. The Wav2Letter network is already on PuzzleLib, so you merely need to import it from the library.

from PuzzleLib.Models.Nets.WaveToLetter import loadW2L
The dimension of the neural network input is determined by the dimension of the extracted features, the size of the window used in calculating the short-time Fourier transform for the audio signal in particular. The dimension of the output data corresponds to the number of tokens. If the continueFrom argument is received, the model is initialized with the saved weights.
    nfft = int(config.sampleRate * config.windowSize)
    w2l = loadW2L(modelpath=args.continueFrom, inmaps=(1 + nfft // 2), nlabels=len(config.labels))
Most speech recognition datasets contain only audio files and their transcripts, but do not contain information about which time segments correspond to which transcription tokens. Moreover, several token sequences can correspond to the same transcript. For example, look at the token sequences for "hello world":
______________h_e_l_l_ooo__ _wo__rr__llll__d__
______________hhh_eel_lll_oo__ __w_o_r_l_ddd__
_______hh__eeee_lll_l_oooo__ _www_ooo_r_ll_d__
This is why the Connectionist Temporal Classification technology is used for neural network training. CTC maximizes the probabilities of all possible token sequences that, for a given sequence length, correspond to a given transcript. The CTC criterion is also implemented in the PuzzleLib. To initialize it, you only need to submit the index of the dummy token.
from PuzzleLib.Cost.CTC import CTC

...

    blankIndex = [i for i in range(len(config.labels)) if config.labels[i] == '_'][0]
    ctc = CTC(blankIndex)
We use Adam as an optimization method.
from PuzzleLib.Optimizers.Adam import Adam

...

    adam = Adam(alpha=args.lr)
    adam.setupOn(w2l, useGlobalState=True)
The train function is called to train the model. The model is put into training mode, then the data loader iteratively generates batches. The batch is transferred to the video card and the model is applied to it.
def train(model, ctc, optimizer, loader, checkpointPerBatch, saveFolder, saveName):
    model.reset()
    model.trainMode()
    loader.reset()

    for i, (data) in enumerate(loader):
        inputs, inputPercentages, targets, targetSizes, _ = data
        gpuInput = gpuarray.to_gpu(inputs.astype(np.float32))
        out = model(gpuInput)
The lengths of predicted outen sequences are estimated via the relative lengths of the input features.
        outlen = gpuarray.to_gpu((out.shape[0] * inputPercentages).astype(np.int32))
The CTC loss function and gradient are calculated by the out model predictions, outlen prediction lengths, targets decoding, and their targetSizes lengths.
        error, grad = ctc([out, outlen], [targets, targetSizes])
The model parameters are updated using the backpropagation method.

        optimizer.zeroGradParams()
        model.backward(grad, updGrad=False)
        optimizer.update()

Decoding and validation

The output of the acoustic model is the tensor of dimension (N_{batch}, T, N_{tokens}). The task is to get the text from the sequence of token probability vectors. We will look at the simplest approach - using the Greedy Decoder. It is implemented in the Decoder.py file. To initialize it, you need to submit a sequence of tokens and the index of the dummy symbol.

from Decoder import GreedyDecoder

...

    decoder = GreedyDecoder(config.labels, blankIndex)
The construction of the prediction text is implemented in the decode method. At each timestep the decoder uses the token with maximum probability. For example, we get a sequence of token indexes corresponding to the token sequence _______hh__eeee_lll_l_oooo__ _www_ooo_r_ll_d__ .
    def decode(self, probs, sizes=None):
        npMaxProbs = np.argmax(probs, 2)
The processString method processes this sequence of token indexes. Iteratively traversing the sequence, a new character char is added to the string prediction, if it is not a dummy character and does not match the previous character in the sequence. As a result, we get the text hello world
    def processString(self, sequence, size):
        string = ''
        for i in range(size):
            char = self.intToChar[sequence[i].item()]
            if char != self.intToChar[self.blankIndex]:
                if i != 0 and char == self.intToChar[sequence[i - 1].item()]:
                    pass
                elif char == self.labels[self.spaceIndex]:
                    string += ' '
                else:
                    string = string + char
        return string  
The Word Error Rate and Character Error Rate metrics compare the predicted text with the original transcript. Both metrics are based on calculating the Levenshtein distance. This metric measures the difference between two sequences. It is defined as the minimum number of single-element operations (which are insertion, deletion, and replacement) required to turn one sequence of characters into another. The metric is implemented in the Levenshtein library. The calcWer function calculates WER between texts s1 and s2. Texts are represented as word sequences. Then the Levenshtein distance between the two sequences is calculated and normalized by the length of the second sequence.
def calcWer(s1, s2):
    b = set(s1.split() + s2.split())
    word2char = dict(zip(b, range(len(b))))
    w1 = [chr(word2char[w]) for w in s1.split()]
    w2 = [chr(word2char[w]) for w in s2.split()]
    return Levenshtein.distance(''.join(w1), ''.join(w2)) / len(''.join(w2))
The calcCer function is implemented for calculating CER. Texts are represented as sequences of characters. Then the Levenshtein distance between them is calculated and normalized by the length of the second sequence.
def calcCer(s1, s2):
    s1, s2, = s1.replace(' ', ''), s2.replace(' ', '')
    return Levenshtein.distance(s1, s2) / len(s2)
Validation is implemented in the validate function. The model is transferred to the application mode. totalCer and totalWer quality metrics are initialized with zeros.
def validate(model, loader, decoder, logPath):
    loader.reset()
    model.evalMode()
    totalCer, totalWer = 0, 0
As in training, a model is applied to iteratively generated batches, and the lengths of the predicted sequences are estimated.
    for i, (data) in enumerate(loader):
        inputs, inputPercentages, targets, targetSizes, inputFile = data
        gpuInput = gpuarray.to_gpu(inputs.astype(np.float32))
        out = model(gpuInput)
        outlen = (out.shape[0] * inputPercentages).astype(np.int32)
The decode method is used to get text predictions.
        decodedOutput = decoder.decode(np.moveaxis(out.get(), 0, 1), outlen)
wer and cer are calculated for each batch prediction. The metrics calculated on the prediction are added to the general totalCer and totalWer.
        wer, cer = 0, 0
        for x in range(len(decodedOutput)):            
            transcript, reference = decodedOutput[x], inputFile[x][1]
            print ('transcript: {}\nreference: {}\nfilepath: {}'.format(transcript, reference, inputFile[x][0]))
            try:
                wer += calcWer(transcript, reference)
                cer += calcCer(transcript, reference)
            except Exception as e:
                print ('encountered exception {}'.format(e))
        totalCer += cer
        totalWer += wer
After completion of the batch cycle, the total wer and cer are calculated and normalized by the length of the validation data set.
    wer = totalWer / len(loader.dataset) * 100
    cer = totalCer / len(loader.dataset) * 100

Results

After 42 epochs of training on pure training data, we achieved the following quality on pure LibriSpeech validation and test data:

Dataset WER CER
dev-clean 17.526 6.611
test-clean 16.495 6.131