Speech recognition with Wav2Letter¶
In this tutorial, we will learn how to use the PuzzleLib library to build a speech recognition system. We will train the Wav2Letter neural network already collected on PuzzleLib, using the open LibriSpeech data set.
Data preparation¶
In order to start model training, you need to prepare special csv files (manifests) for training, validation, and test data. A line in the manifest file contains the path to the audio file and its transcript.
The script PrepareLibrispeech.py allows you to download the LibriSpeech dataset, convert audio files from flac to wav, and create 6 manifest files: two of each (clear and noisier data) for training, validation, and testing. The lines of manifest files are sorted by the length of the corresponding audio recordings. This helps to train the model more efficiently. The script arguments are the full path for saving data dataDir and the path for saving manifest files manifestDir.
Extracting features¶
There are different ways to extract features from audio data. We use normalized spectrograms, since they allow describing the data in quite detail. Hyperparameters, as well as a token dictionary, are set in the LibriConfig class of the Config.py file. The labels token dictionary consists of the Latin alphabet, an apostrophe, a space, and a dummy character that represents the letter's omission.
class LibriConfig:
sampleRate = 16000
windowSize = 0.02
windowStride = 0.01
window = 'hamming'
labels = "_'abcdefghijklmnopqrstuvwxyz "
The transformations applied to audio are implemented in the preprocess function of the Data/dataLoader.py script. The audio is read at a specific signal sampling, sampleRate, and then short-time Fourier transform is applied to the signal with the specified window size windowSize, step windowStride, and window function type window. The real part of the received complex-valued spectrogram is extracted. The described methods are implemented in the librosa library.
def preprocess(audioPath, sampleRate, windowSize, windowStride, window):
y = loadAudio(audioPath)
nFft = int(sampleRate * windowSize)
winLength = nFft
hopLength = int(sampleRate * windowStride)
D = librosa.stft(y, n_fft=nFft, hop_length=hopLength, win_length=winLength, window=window)
spect, phase = librosa.magphase(D)
pcen function is applied to the spectrogram. This is a popular audio normalization technology in speech recognition. It allows reducing the background noise and emphasizing the foreground sound.
pcenResult = pcen(E=spect, sr=sampleRate, hopLength=hopLength)
return pcenResult
SpectrogramDataset class is initialized by the manifest file and config. It describes an audio dataset with transcripts, retrieves audio file paths and transcriptions from the manifest file, creates a labelsMap token dictionary and sets audio data processing hyperparameters.
class SpectrogramDataset(object):
def __init__(self, manifestFilePath, config):
self.manifestFilePath = manifestFilePath
with open(self.manifestFilePath) as f:
ids = f.readlines()
self.ids = [x.strip().split(',') for x in ids]
self.size = len(ids)
self.labelsMap = dict([(config.labels[i], i) for i in range(len(config.labels))])
self.sampleRate = config.sampleRate
self.windowSize = config.windowSize
self.windowStride = config.windowStride
self.window = config.window
__getitem__ method returns the spectrogram calculated via the preprocess function, the transcript translated into a sequence of token indexes, the path to the audio file, and the text transcript itself.
def __getitem__(self, index):
sample = self.ids[index]
audioPath, transcriptLoaded = sample[0], sample[1]
spect = preprocess(audioPath, self.sampleRate, self.windowSize, self.windowStride, self.window)
transcript = list(filter(None, [self.labelsMap.get(x) for x in list(transcriptLoaded)]))
return spect, transcript, audioPath, transcriptLoaded
BucketingSampler class is implemented to agregate data in batches. It shuffles data indexes, if necessary, and groups them into bins. The size of the bin corresponds to the size of the batch.
class BucketingSampler(object):
def __init__(self, dataSource, batchSize=1, shuffle=False):
self.dataSource = dataSource
self.batchSize = batchSize
self.ids = list(range(0, len(dataSource)))
self.shuffle = shuffle
self.reset()
...
def getBins(self):
if self.shuffle:
np.random.shuffle(self.ids)
self.bins = [self.ids[i:i + self.batchSize] for i in range(0, len(self.ids), self.batchSize)]
def reset(self):
self.getBins()
self.batchId = 0
DataLoader class uses these index groups to generate batches iteratively.
class DataLoader(object):
def __init__(self, dataset, batchSampler):
self.dataset = dataset
self.batchSampler = batchSampler
self.sampleIter = iter(self.batchSampler)
def __next__(self):
try:
indices = next(self.sampleIter)
indices = [i for i in indices][0]
batch = getBatch([self.dataset[i] for i in indices])
return batch
except:
raise StopIteration()
getBatch function is used to generate batches. The features extracted from audio signals are reduced to one maximum length via a wrap padding and are aggregated into a three-dimensional inputs tensor. That is why we have sorted the audio in manifest files by length, since aggregating audio signals by length into a single batch is efficient. The relative lengths of the signals are recorded in the inputPercentages.
seqLength = tensor.shape[1]
tensorNew = np.pad(tensor, ((0, 0), (0, abs(seqLength-maxSeqlength))), 'wrap')
inputs[x] = tensorNew
inputPercentages[x] = seqLength / float(maxSeqlength)
targets, and the lengths of the original sequences are stored in targetSizes. These variables form a batch together with the path to the audio file and the transcript.
targetSizes[x] = len(target)
targets.extend(target)
inputFilePathAndTranscription.append([tensorPath, orignalTranscription])
targets = np.array(targets)
return inputs, inputPercentages, targets, targetSizes, inputFilePathAndTranscription
Speech recognition system¶
The speech recognition system consists of two modules: acoustic model and decoding. The acoustic model establishes a relationship between the audio signal and the probability distribution of tokens in it. Here we have a convolutional neural network that receives an input of a sequence of features extracted from an audio signal, and outputs a sequence of token probability vectors in it. Which means that the model predicts the probabilities of tokens on each fixed time window. This is what happens if you take tokens from the output of the acoustic model with the maximum probability: ______________h_eee__l_l__ooooo_ _w__oor__lll_dd__. Decoding converts a sequence of token probability vectors into the text hello world.
Acoustic model training and validation on the corresponding manifest files are implemented in the Train.py script. The script arguments are:
- paths to training and validation manifest files, if None is received, training or validation will be skipped
parser.add_argument('--trainManifest', metavar='DIR', help='path to train manifest csv', default='Data/train-clean.csv') parser.add_argument('--valManifests', metavar='DIR', help='path to validation manifest csv', default=['Data/dev-clean.csv']) - Batch size, number of learning epochs, learning rate
parser.add_argument('--batchSize', default=8, type=int, help='Batch size for training') parser.add_argument('--epochs', default=100, type=int, help='Number of training epochs') parser.add_argument('--lr', '--learningRate', default=1e-5, type=float, help='initial learning rate') - arguments responsible for defining the saving frequence for acoustic model weights, prefix of weight file names and path to the directory weights are saved
parser.add_argument('--checkpointPerBatch', default=3000, type=int, help='Save checkpoint per batch. 0 means never save') parser.add_argument('--checkpointName', default='w2l', type=str, help='Name of checkpoints') parser.add_argument('--saveFolder', default='Checkpoints/', help='Location to save epoch models') - path to the acoustic model weights to initialize it with
Training and validation data loaders are initialized with the previously described data handling classes and the getDataLoader function: trainLoader and valLoaders.
parser.add_argument('--continueFrom', default=None, help='Continue from checkpoint model')from Data.dataLoader import DataLoader, SpectrogramDataset, BucketingSampler ... def getDataLoader(manifestFilePath, config, batchSize): dataset = SpectrogramDataset(manifestFilePath=manifestFilePath, config=config) sampler = BucketingSampler(dataset, batchSize=batchSize) return DataLoader(dataset, batchSampler=sampler)
Training the acoustic mode¶
The acoustic model neural network architecture we use is inspired by the Wav2Letter technology.
The advantage of this acoustic model is that it consists entirely of convolutional layers, which leads to more efficient calculations. The Wav2Letter network is already on PuzzleLib, so you merely need to import it from the library.
from PuzzleLib.Models.Nets.WaveToLetter import loadW2L
nfft = int(config.sampleRate * config.windowSize)
w2l = loadW2L(modelpath=args.continueFrom, inmaps=(1 + nfft // 2), nlabels=len(config.labels))
______________h_e_l_l_ooo__ _wo__rr__llll__d________________hhh_eel_lll_oo__ __w_o_r_l_ddd_________hh__eeee_lll_l_oooo__ _www_ooo_r_ll_d__This is why the Connectionist Temporal Classification technology is used for neural network training. CTC maximizes the probabilities of all possible token sequences that, for a given sequence length, correspond to a given transcript. The CTC criterion is also implemented in the
PuzzleLib. To initialize it, you only need to submit the index of the dummy token.
from PuzzleLib.Cost.CTC import CTC
...
blankIndex = [i for i in range(len(config.labels)) if config.labels[i] == '_'][0]
ctc = CTC(blankIndex)
from PuzzleLib.Optimizers.Adam import Adam
...
adam = Adam(alpha=args.lr)
adam.setupOn(w2l, useGlobalState=True)
train function is called to train the model. The model is put into training mode, then the data loader iteratively generates batches. The batch is transferred to the video card and the model is applied to it.
def train(model, ctc, optimizer, loader, checkpointPerBatch, saveFolder, saveName):
model.reset()
model.trainMode()
loader.reset()
for i, (data) in enumerate(loader):
inputs, inputPercentages, targets, targetSizes, _ = data
gpuInput = gpuarray.to_gpu(inputs.astype(np.float32))
out = model(gpuInput)
outlen = gpuarray.to_gpu((out.shape[0] * inputPercentages).astype(np.int32))
out model predictions, outlen prediction lengths, targets decoding, and their targetSizes lengths.
error, grad = ctc([out, outlen], [targets, targetSizes])
optimizer.zeroGradParams()
model.backward(grad, updGrad=False)
optimizer.update()
Decoding and validation¶
The output of the acoustic model is the tensor of dimension (N_{batch}, T, N_{tokens}). The task is to get the text from the sequence of token probability vectors. We will look at the simplest approach - using the Greedy Decoder. It is implemented in the Decoder.py file. To initialize it, you need to submit a sequence of tokens and the index of the dummy symbol.
from Decoder import GreedyDecoder
...
decoder = GreedyDecoder(config.labels, blankIndex)
decode method. At each timestep the decoder uses the token with maximum probability. For example, we get a sequence of token indexes corresponding to the token sequence _______hh__eeee_lll_l_oooo__ _www_ooo_r_ll_d__ .
def decode(self, probs, sizes=None):
npMaxProbs = np.argmax(probs, 2)
processString method processes this sequence of token indexes. Iteratively traversing the sequence, a new character char is added to the string prediction, if it is not a dummy character and does not match the previous character in the sequence. As a result, we get the text hello world
def processString(self, sequence, size):
string = ''
for i in range(size):
char = self.intToChar[sequence[i].item()]
if char != self.intToChar[self.blankIndex]:
if i != 0 and char == self.intToChar[sequence[i - 1].item()]:
pass
elif char == self.labels[self.spaceIndex]:
string += ' '
else:
string = string + char
return string
calcWer function calculates WER between texts s1 and s2. Texts are represented as word sequences. Then the Levenshtein distance between the two sequences is calculated and normalized by the length of the second sequence.
def calcWer(s1, s2):
b = set(s1.split() + s2.split())
word2char = dict(zip(b, range(len(b))))
w1 = [chr(word2char[w]) for w in s1.split()]
w2 = [chr(word2char[w]) for w in s2.split()]
return Levenshtein.distance(''.join(w1), ''.join(w2)) / len(''.join(w2))
calcCer function is implemented for calculating CER. Texts are represented as sequences of characters. Then the Levenshtein distance between them is calculated and normalized by the length of the second sequence.
def calcCer(s1, s2):
s1, s2, = s1.replace(' ', ''), s2.replace(' ', '')
return Levenshtein.distance(s1, s2) / len(s2)
validate function. The model is transferred to the application mode. totalCer and totalWer quality metrics are initialized with zeros.
def validate(model, loader, decoder, logPath):
loader.reset()
model.evalMode()
totalCer, totalWer = 0, 0
for i, (data) in enumerate(loader):
inputs, inputPercentages, targets, targetSizes, inputFile = data
gpuInput = gpuarray.to_gpu(inputs.astype(np.float32))
out = model(gpuInput)
outlen = (out.shape[0] * inputPercentages).astype(np.int32)
decodedOutput = decoder.decode(np.moveaxis(out.get(), 0, 1), outlen)
wer and cer are calculated for each batch prediction. The metrics calculated on the prediction are added to the general totalCer and totalWer.
wer, cer = 0, 0
for x in range(len(decodedOutput)):
transcript, reference = decodedOutput[x], inputFile[x][1]
print ('transcript: {}\nreference: {}\nfilepath: {}'.format(transcript, reference, inputFile[x][0]))
try:
wer += calcWer(transcript, reference)
cer += calcCer(transcript, reference)
except Exception as e:
print ('encountered exception {}'.format(e))
totalCer += cer
totalWer += wer
wer and cer are calculated and normalized by the length of the validation data set.
wer = totalWer / len(loader.dataset) * 100
cer = totalCer / len(loader.dataset) * 100
Results¶
After 42 epochs of training on pure training data, we achieved the following quality on pure LibriSpeech validation and test data:
| Dataset | WER | CER |
|---|---|---|
| dev-clean | 17.526 | 6.611 |
| test-clean | 16.495 | 6.131 |