Releases: sfarhat/Convolutional-Forced-Alignment
Trained TIMIT Model + Forced Alignment + GRAD-CAM
A deep CNN inspired by Zhang et. al 2016 trained on the TIMIT dataset, using a different loss than CTC. Focusing on the idea of an "ideal alignment" and a slight modification to preprocessing the data, this model is able to classify phonemes with a Phoneme Error Rate of around 22% on the TIMIT test set, in addition to a 67 ms Average Alignment Error on both the train and test sets.
The hyperparameters are:
ADAM LR: 10e-5
Batch size: 3
Epochs: 15
Activation: PReLU
In addition, Forced alignment is possible on any provided input file as well as generating class activation maps (GRAD-CAM) for desired phonemes/words.
Trained Librispeech Model
A network similar to that described in this paper (Zhang et. al 2017) is fully implemented and trained. The differing details of this trained model relative to the one above are as follows:
- PReLU activation
- batch size of 3
- ADAM optimizer with LR=10e-5
- trained for 50 epochs on the train-clean-100 dataset of LibriSpeech
The trained model weights are attached. It achieves a 29.97% Character Error Rate based off of greedy decoding.