This is my pytorch implementation for the Listen, Attend and Spell (LAS) google ASR deep learning model. I used both the mozilla Common voice dataset and the LibriSpeech dataset.
The feature transformation is done on the fly while loading the files thanks to torchaudio.
This are the LER (letter error rate) and loss metrics for 4 epochs of training with a considerably smaller architecture since my gpu didnt have enough memory. Listener had 128 neurons and 2 layers while the Speller had 256 neurons with 2 layers as well.
We can see how the model is able to learn from the data we are feeding to it but it still needs more training and a proper architecture.
Letter error rate | Loss |
![]() |
![]() |
If we try to predict a sample of audio the results now look like:
: ['A', 'N', 'D', '', 'S', 'T', 'I', 'L', 'L', '', 'N', 'O', '', 'A',
'T', 'T', 'E', 'M', 'P', 'T', '', 'B', 'Y', '', 'T', 'H', 'E', '',
'P', 'O']
:['A', 'N', 'D', '', 'T', 'H', 'E', 'L', 'L', '', 'T', 'O', 'T', 'M',
'', 'T', 'E', 'N', 'P', 'T', '', 'O', 'E', '', 'T', 'H', 'E', '',
'S', 'R']
Only the conjunction are being properly indentified, this led us to think the model needs higher training times to be able to learn more specific words.
#Will train more and update results here, still looking for credits in cloud compute
Code is setup to run with both the mozilla Common voice dataset and the LibriSpeech dataset. If you want to run the code you should download the datasets and extract them under data/ or run the script utils/
which will download it and extract it in the following format:
├── LibriSpeech
│ ├── dev-clean/
│ ├── test-clean/
│ └── train-clean-100/
└── mozilla
├── dev.tsv
├── invalidated.tsv
├── mp3/
├── other.tsv
├── test.tsv
├── train.tsv
└── validated.tsv
So run
#Remove flags if you want to avoid download that specific dataset
$ python utils/ --libri --common
And run the following commands to process and collect all files.
#Still in utils/
$ python utils/ --root $ABSOLUTEPATH TO DATASET
$ python uitls/ --root $ABSOLUTEPATH TO DATASET
This will create a processed/
folder inside each of the datassets containing the csvs with teh data neccesary to train along vocabulary and word count files.
Execute the train script along with the yaml config file for the desired dataset.
$ python --config_path config/librispeech-config.yaml
# Or
$ python --config_path config/common_voice-config.yaml
Loss and lert will be logged to the runs/
folder, you can check them by running tensoboard in the root directory.