Speech recognition with Wav2Letter¶
In this tutorial, we will learn how to use the PuzzleLib
library to build a speech recognition system. We will train the Wav2Letter neural network already collected on PuzzleLib
, using the open LibriSpeech data set.
Data preparation¶
In order to start model training, you need to prepare special csv files (manifests) for training, validation, and test data. A line in the manifest file contains the path to the audio file and its transcript.
The script PrepareLibrispeech.py allows you to download the LibriSpeech dataset, convert audio files from flac to wav, and create 6 manifest files: two of each (clear and noisier data) for training, validation, and testing. The lines of manifest files are sorted by the length of the corresponding audio recordings. This helps to train the model more efficiently. The script arguments are the full path for saving data dataDir
and the path for saving manifest files manifestDir
.
Extracting features¶
There are different ways to extract features from audio data. We use normalized spectrograms, since they allow describing the data in quite detail. Hyperparameters, as well as a token dictionary, are set in the LibriConfig
class of the Config.py file. The labels
token dictionary consists of the Latin alphabet, an apostrophe, a space, and a dummy character that represents the letter's omission.
class LibriConfig:
sampleRate = 16000
windowSize = 0.02
windowStride = 0.01
window = 'hamming'
labels = "_'abcdefghijklmnopqrstuvwxyz "
The transformations applied to audio are implemented in the preprocess function of the Data/dataLoader.py script. The audio is read at a specific signal sampling, sampleRate
, and then short-time Fourier transform is applied to the signal with the specified window size windowSize
, step windowStride
, and window function type window
. The real part of the received complex-valued spectrogram is extracted. The described methods are implemented in the librosa library.
def preprocess(audioPath, sampleRate, windowSize, windowStride, window):
y = loadAudio(audioPath)
nFft = int(sampleRate * windowSize)
winLength = nFft
hopLength = int(sampleRate * windowStride)
D = librosa.stft(y, n_fft=nFft, hop_length=hopLength, win_length=winLength, window=window)
spect, phase = librosa.magphase(D)
pcen
function is applied to the spectrogram. This is a popular audio normalization technology in speech recognition. It allows reducing the background noise and emphasizing the foreground sound.
pcenResult = pcen(E=spect, sr=sampleRate, hopLength=hopLength)
return pcenResult
SpectrogramDataset
class is initialized by the manifest file and config. It describes an audio dataset with transcripts, retrieves audio file paths and transcriptions from the manifest file, creates a labelsMap
token dictionary and sets audio data processing hyperparameters.
class SpectrogramDataset(object):
def __init__(self, manifestFilePath, config):
self.manifestFilePath = manifestFilePath
with open(self.manifestFilePath) as f:
ids = f.readlines()
self.ids = [x.strip().split(',') for x in ids]
self.size = len(ids)
self.labelsMap = dict([(config.labels[i], i) for i in range(len(config.labels))])
self.sampleRate = config.sampleRate
self.windowSize = config.windowSize
self.windowStride = config.windowStride
self.window = config.window
__getitem__
method returns the spectrogram calculated via the preprocess function, the transcript translated into a sequence of token indexes, the path to the audio file, and the text transcript itself.
def __getitem__(self, index):
sample = self.ids[index]
audioPath, transcriptLoaded = sample[0], sample[1]
spect = preprocess(audioPath, self.sampleRate, self.windowSize, self.windowStride, self.window)
transcript = list(filter(None, [self.labelsMap.get(x) for x in list(transcriptLoaded)]))
return spect, transcript, audioPath, transcriptLoaded
BucketingSampler
class is implemented to agregate data in batches. It shuffles data indexes, if necessary, and groups them into bins. The size of the bin corresponds to the size of the batch.
class BucketingSampler(object):
def __init__(self, dataSource, batchSize=1, shuffle=False):
self.dataSource = dataSource
self.batchSize = batchSize
self.ids = list(range(0, len(dataSource)))
self.shuffle = shuffle
self.reset()
...
def getBins(self):
if self.shuffle:
np.random.shuffle(self.ids)
self.bins = [self.ids[i:i + self.batchSize] for i in range(0, len(self.ids), self.batchSize)]
def reset(self):
self.getBins()
self.batchId = 0
DataLoader
class uses these index groups to generate batches iteratively.
class DataLoader(object):
def __init__(self, dataset, batchSampler):
self.dataset = dataset
self.batchSampler = batchSampler
self.sampleIter = iter(self.batchSampler)
def __next__(self):
try:
indices = next(self.sampleIter)
indices = [i for i in indices][0]
batch = getBatch([self.dataset[i] for i in indices])
return batch
except:
raise StopIteration()
getBatch
function is used to generate batches. The features extracted from audio signals are reduced to one maximum length via a wrap padding and are aggregated into a three-dimensional inputs
tensor. That is why we have sorted the audio in manifest files by length, since aggregating audio signals by length into a single batch is efficient. The relative lengths of the signals are recorded in the inputPercentages
.
seqLength = tensor.shape[1]
tensorNew = np.pad(tensor, ((0, 0), (0, abs(seqLength-maxSeqlength))), 'wrap')
inputs[x] = tensorNew
inputPercentages[x] = seqLength / float(maxSeqlength)
targets
, and the lengths of the original sequences are stored in targetSizes
. These variables form a batch together with the path to the audio file and the transcript.
targetSizes[x] = len(target)
targets.extend(target)
inputFilePathAndTranscription.append([tensorPath, orignalTranscription])
targets = np.array(targets)
return inputs, inputPercentages, targets, targetSizes, inputFilePathAndTranscription
Speech recognition system¶
The speech recognition system consists of two modules: acoustic model and decoding. The acoustic model establishes a relationship between the audio signal and the probability distribution of tokens in it. Here we have a convolutional neural network that receives an input of a sequence of features extracted from an audio signal, and outputs a sequence of token probability vectors in it. Which means that the model predicts the probabilities of tokens on each fixed time window. This is what happens if you take tokens from the output of the acoustic model with the maximum probability: ______________h_eee__l_l__ooooo_ _w__oor__lll_dd__
. Decoding converts a sequence of token probability vectors into the text hello world
.
Acoustic model training and validation on the corresponding manifest files are implemented in the Train.py script. The script arguments are:
- paths to training and validation manifest files, if None is received, training or validation will be skipped
parser.add_argument('--trainManifest', metavar='DIR', help='path to train manifest csv', default='Data/train-clean.csv') parser.add_argument('--valManifests', metavar='DIR', help='path to validation manifest csv', default=['Data/dev-clean.csv'])
- Batch size, number of learning epochs, learning rate
parser.add_argument('--batchSize', default=8, type=int, help='Batch size for training') parser.add_argument('--epochs', default=100, type=int, help='Number of training epochs') parser.add_argument('--lr', '--learningRate', default=1e-5, type=float, help='initial learning rate')
- arguments responsible for defining the saving frequence for acoustic model weights, prefix of weight file names and path to the directory weights are saved
parser.add_argument('--checkpointPerBatch', default=3000, type=int, help='Save checkpoint per batch. 0 means never save') parser.add_argument('--checkpointName', default='w2l', type=str, help='Name of checkpoints') parser.add_argument('--saveFolder', default='Checkpoints/', help='Location to save epoch models')
- path to the acoustic model weights to initialize it with
Training and validation data loaders are initialized with the previously described data handling classes and the getDataLoader function: trainLoader and valLoaders.
parser.add_argument('--continueFrom', default=None, help='Continue from checkpoint model')
from Data.dataLoader import DataLoader, SpectrogramDataset, BucketingSampler ... def getDataLoader(manifestFilePath, config, batchSize): dataset = SpectrogramDataset(manifestFilePath=manifestFilePath, config=config) sampler = BucketingSampler(dataset, batchSize=batchSize) return DataLoader(dataset, batchSampler=sampler)
Training the acoustic mode¶
The acoustic model neural network architecture we use is inspired by the Wav2Letter technology.
The advantage of this acoustic model is that it consists entirely of convolutional layers, which leads to more efficient calculations. The Wav2Letter network is already on PuzzleLib
, so you merely need to import it from the library.
from PuzzleLib.Models.Nets.WaveToLetter import loadW2L
nfft = int(config.sampleRate * config.windowSize)
w2l = loadW2L(modelpath=args.continueFrom, inmaps=(1 + nfft // 2), nlabels=len(config.labels))
______________h_e_l_l_ooo__ _wo__rr__llll__d__
______________hhh_eel_lll_oo__ __w_o_r_l_ddd__
_______hh__eeee_lll_l_oooo__ _www_ooo_r_ll_d__
This is why the Connectionist Temporal Classification technology is used for neural network training. CTC maximizes the probabilities of all possible token sequences that, for a given sequence length, correspond to a given transcript. The CTC criterion is also implemented in the
PuzzleLib
. To initialize it, you only need to submit the index of the dummy token.
from PuzzleLib.Cost.CTC import CTC
...
blankIndex = [i for i in range(len(config.labels)) if config.labels[i] == '_'][0]
ctc = CTC(blankIndex)
from PuzzleLib.Optimizers.Adam import Adam
...
adam = Adam(alpha=args.lr)
adam.setupOn(w2l, useGlobalState=True)
train
function is called to train the model. The model is put into training mode, then the data loader iteratively generates batches. The batch is transferred to the video card and the model is applied to it.
def train(model, ctc, optimizer, loader, checkpointPerBatch, saveFolder, saveName):
model.reset()
model.trainMode()
loader.reset()
for i, (data) in enumerate(loader):
inputs, inputPercentages, targets, targetSizes, _ = data
gpuInput = gpuarray.to_gpu(inputs.astype(np.float32))
out = model(gpuInput)
outlen = gpuarray.to_gpu((out.shape[0] * inputPercentages).astype(np.int32))
out
model predictions, outlen
prediction lengths, targets
decoding, and their targetSizes
lengths.
error, grad = ctc([out, outlen], [targets, targetSizes])
optimizer.zeroGradParams()
model.backward(grad, updGrad=False)
optimizer.update()
Decoding and validation¶
The output of the acoustic model is the tensor of dimension (N_{batch}, T, N_{tokens}). The task is to get the text from the sequence of token probability vectors. We will look at the simplest approach - using the Greedy Decoder. It is implemented in the Decoder.py file. To initialize it, you need to submit a sequence of tokens and the index of the dummy symbol.
from Decoder import GreedyDecoder
...
decoder = GreedyDecoder(config.labels, blankIndex)
decode
method. At each timestep the decoder uses the token with maximum probability. For example, we get a sequence of token indexes corresponding to the token sequence _______hh__eeee_lll_l_oooo__ _www_ooo_r_ll_d__
.
def decode(self, probs, sizes=None):
npMaxProbs = np.argmax(probs, 2)
processString
method processes this sequence of token indexes. Iteratively traversing the sequence, a new character char
is added to the string
prediction, if it is not a dummy character and does not match the previous character in the sequence. As a result, we get the text hello world
def processString(self, sequence, size):
string = ''
for i in range(size):
char = self.intToChar[sequence[i].item()]
if char != self.intToChar[self.blankIndex]:
if i != 0 and char == self.intToChar[sequence[i - 1].item()]:
pass
elif char == self.labels[self.spaceIndex]:
string += ' '
else:
string = string + char
return string
calcWer
function calculates WER between texts s1
and s2
. Texts are represented as word sequences. Then the Levenshtein distance between the two sequences is calculated and normalized by the length of the second sequence.
def calcWer(s1, s2):
b = set(s1.split() + s2.split())
word2char = dict(zip(b, range(len(b))))
w1 = [chr(word2char[w]) for w in s1.split()]
w2 = [chr(word2char[w]) for w in s2.split()]
return Levenshtein.distance(''.join(w1), ''.join(w2)) / len(''.join(w2))
calcCer
function is implemented for calculating CER. Texts are represented as sequences of characters. Then the Levenshtein distance between them is calculated and normalized by the length of the second sequence.
def calcCer(s1, s2):
s1, s2, = s1.replace(' ', ''), s2.replace(' ', '')
return Levenshtein.distance(s1, s2) / len(s2)
validate
function. The model is transferred to the application mode. totalCer
and totalWer
quality metrics are initialized with zeros.
def validate(model, loader, decoder, logPath):
loader.reset()
model.evalMode()
totalCer, totalWer = 0, 0
for i, (data) in enumerate(loader):
inputs, inputPercentages, targets, targetSizes, inputFile = data
gpuInput = gpuarray.to_gpu(inputs.astype(np.float32))
out = model(gpuInput)
outlen = (out.shape[0] * inputPercentages).astype(np.int32)
decodedOutput = decoder.decode(np.moveaxis(out.get(), 0, 1), outlen)
wer
and cer
are calculated for each batch prediction. The metrics calculated on the prediction are added to the general totalCer
and totalWer
.
wer, cer = 0, 0
for x in range(len(decodedOutput)):
transcript, reference = decodedOutput[x], inputFile[x][1]
print ('transcript: {}\nreference: {}\nfilepath: {}'.format(transcript, reference, inputFile[x][0]))
try:
wer += calcWer(transcript, reference)
cer += calcCer(transcript, reference)
except Exception as e:
print ('encountered exception {}'.format(e))
totalCer += cer
totalWer += wer
wer
and cer
are calculated and normalized by the length of the validation data set.
wer = totalWer / len(loader.dataset) * 100
cer = totalCer / len(loader.dataset) * 100
Results¶
After 42 epochs of training on pure training data, we achieved the following quality on pure LibriSpeech validation and test data:
Dataset | WER | CER |
---|---|---|
dev-clean | 17.526 | 6.611 |
test-clean | 16.495 | 6.131 |