A new automatic speech recognizer for Brazilian Portuguese based on deep neural networks and transfer learning

Pytorch code of "A new automatic speech recognizer for Brazilian Portuguese based on deep neural networks and transfer learning" submitted to AES-LAC 2018

TL;DR

The paper demonstrates how to perform transfer learning from a pre-trained model based on Deep Speech 2 for English to Brazilian Portuguese, outperforming previous work, achieving a charater error rate (CER) of ~16%.

Abstract

This paper addresses the problem of training deep learning models for automatic speech recognition on languages with few resources available, such as Brazilian Portuguese, by employing transfer learning strategies. From a backbone model trained in English, the best fine-tuned network reduces the character error rate by 8.5%, outperforming previous works.

Installation

Several libraries are needed to be installed for training to work. I will assume that everything is being installed in an Anaconda installation on Ubuntu.

Install PyTorch if you haven't already.

Clone this repo and run this within the repo:

pip install -r requirements.txt

Install this fork for Warp-CTC bindings:

git clone https://github.com/SeanNaren/warp-ctc.git
cd warp-ctc
mkdir build; cd build
cmake ..
make
export CUDA_HOME="/usr/local/cuda"
cd ../pytorch_binding
python setup.py install

Install pytorch audio:

sudo apt-get install sox libsox-dev libsox-fmt-all
git clone https://github.com/pytorch/audio.git
cd audio
python setup.py install

Install ignite:

git clone https://github.com/pytorch/ignite.git && \
cd ignite && \
python setup.py install && \
cd .. && \
rm -rf ignite

Docker

Also, we provide a Dockerfile. See here for how to setup correctly the docker.

Usage

Dataset

Five datasets were used in this paper. The Librispeech was used to train the backbone model, while the VoxForge PT-BR, Sid and Spoltech were used for fine-tune the pre-trained model to Brazilian Portuguese, and the LapsBM was used to performing the validation and test.

Librispeech

To download and setup the Librispeech dataset run below command in the root folder of the repo:

python -m data.librispeech

Note that this dataset does not come with a validation dataset or test dataset.

Voxforge

To download and setup the VoxForge dataset run below command in the root folder of the repo:

python -m data.voxforge

Note that this dataset does not come with a validation dataset or test dataset.

Sid

To download and setup the Sid dataset run the below command in the root folder of the repo:

python -m data.sid

Note that this dataset does not come with a validation dataset or test dataset.

Spoltech

The Spoltech dataset is not publicly available, so, you need to buy and download here. Then, you must extract into data/spoltech_dataset/downloads/extracted/files. Finally run the below command in the root folder of the repo:

python -m data.spoltech

Note that this dataset does not come with a validation dataset or test dataset.

LapsBM

To download and setup the LapsBM dataset run the below command in the root folder of the repo:

python -m data.lapsbm

Custom Dataset

To create a custom dataset you must create a CSV file containing the locations of the training data. This has to be in the format of:

/path/to/audio.wav,/path/to/text.txt
/path/to/audio2.wav,/path/to/text2.txt
...

The first path is to the audio file, and the second path is to a text file containing the transcript on one line. This can then be used as stated below.

Creating the PT-BR training manifest

The PT-BR training manifest is an ensemble of three smaller datasets: VoxForge, Spoltech and Sid.

Merging multiple manifest files

To create bigger manifest files (to train/test on multiple datasets at once) we can merge manifest files together like below from a directory containing all the manifests you want to merge. You can also prune short and long clips out of the new manifest.

cd data/
python merge_manifests.py --output-path pt_BR.train.csv sid.train.csv spoltech.train.csv voxforge.train.csv

Training

The script train.py allows training the Deep Speech 2 model using a variety of hyperparameters and arbitrary datasets by using a configuration .json file as an input. You can check several examples in the scripts folder.

Also, options like checkpoints and the localization of the results folder are inserted through the command-line. You may prompt

python train.py --help

or just check out train.py for more details.

Checkpoints

Training supports saving checkpoints of the model to continue training. To enable epoch checkpoints use:

python train.py --checkpoint

To continue from a checkpointed model that has been saved:

python train.py --continue-from path/to/model.pth.tar

To also note, there is no final softmax layer on the model as when trained, warp-ctc does this softmax internally. This will have to also be implemented in complex decoders if anything is built on top of the model, so take this into consideration!

Backbone model

To train the backbone model, we shall prompt

python train.py scripts/librispeech-from_scratch.json --data-dir data/ --train-manifest data/librispeech.train.csv --val-manifest data/librispeech.val.csv --local --checkpoint

and a folder called [result/librispeech-from_scratch] will be created containing the models checkpoint and the best model file. In the paper, our best backbone model achieved an word error rate (WER) of 11.66% and 30.70% on test-clean and test-other datasets.

Fine-tuning with the same label set

The commands listed below are the experiments conducted in the Sec. 5.2 of the paper

From scratch

python train.py scripts/pt_BR-from_scratch.json --data-dir data/ --train-manifest data/pt_BR.train.csv --val-manifest data/lapsbm.val.csv --local --checkpoint

Freeze

python train.py scripts/pt_BR-finetune-freeze.json --data-dir data/ --train-manifest data/pt_BR.train.csv --val-manifest data/lapsbm.val.csv --continue-from results/librispeech-from_scratch/models/model_best-ckpt_5.pth --local --checkpoint

Fine-tune

with lr=3e-4

python train.py scripts/pt_BR-finetune.json --data-dir data/ --train-manifest data/pt_BR.train.csv --val-manifest data/lapsbm.val.csv --continue-from results/librispeech-from_scratch/models/model_best-ckpt_5.pth --local --checkpoint

with lr=3e-5

python train.py scripts/pt_BR-finetune-lower-lr.json --data-dir data/ --train-manifest data/pt_BR.train.csv --val-manifest data/lapsbm.val.csv --continue-from results/librispeech-from_scratch/models/model_best-ckpt_5.pth --local --checkpoint

Results

The results of these models in the test set are listed below

	[12]	scratch	freeze	fine-tuning
CER	25.13%	22.19%	30.80%	16.17%

Fine-tuning with a broader label set

The commands listed below are the experiments conducted in the Sec. 5.3 of the paper

From sratch

python train.py scripts/pt_BR-from_scratch-accents.json --data-dir data/ --train-manifest data/pt_BR.train.csv --val-manifest data/lapsbm.val.csv --local --checkpoint

Random FC weights

python train.py scripts/pt_BR-finetune-accents-random-fc.json --data-dir data/ --train-manifest data/pt_BR.train.csv --val-manifest data/lapsbm.val.csv --continue-from results/librispeech-from_scratch/models/model_best-ckpt_5.pth --local --checkpoint

Non-random FC weights

python train.py scripts/pt_BR-finetune-accents-map-fc.json --data-dir data/ --train-manifest data/pt_BR.train.csv --val-manifest data/lapsbm.val.csv --continue-from results/librispeech-from_scratch/models/model_best-ckpt_5.pth --local --checkpoint

Results

	scratch	random FC weights	non-random FC weights
CER	22.78%	17.73%	17.72%

Testing/Inference

To evaluate a trained model on a test set (has to be in the same format as the training set):

python test.py --model-path models/deepspeech.pth --manifest /path/to/test_manifest.csv --cuda

Pre-trained models

Pre-trained models can be found under releases here.

Citation

If you use this code in your research, please use the following BibTeX entry

@inproceedings{quintanilha2018,
    author = {Quintanilha, I. M. and Biscainho, L. W. P. and Netto, S. L.},
    title = "A new automatic speech recognizer for Brazilian Portuguese based on deep neural networks and transfer learning",
    booktitle = "Congreso Latinoamericano de Ingenier\'{i}a de Audio",
    address = {Montevideo, Uruguay},
    month = {September},
    year = {2018},
    note = {(Submitted)}
}

Acknowledgements

This research was partially supported by CNPq and CAPES.

Thanks to SeanNaren on whose implementation ours was inspired.

References

[12] I. M. Quintanilha, End-to-end speech recognition applied to Brazilian Portuguese using deep learning, M. Sc. dissertation, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil (2017).

Name		Name	Last commit message	Last commit date
Latest commit History 433 Commits
codes		codes
data		data
docker		docker
imgs		imgs
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
test.py		test.py
train.py		train.py

License

igormq/aes-lac-2018

Folders and files

Latest commit

History

Repository files navigation

A new automatic speech recognizer for Brazilian Portuguese based on deep neural networks and transfer learning

Abstract

Installation

Docker

Usage

Dataset

Librispeech

Voxforge

Sid

Spoltech

LapsBM

Custom Dataset

Creating the PT-BR training manifest

Merging multiple manifest files

Training

Checkpoints

Backbone model

Fine-tuning with the same label set

From scratch

Freeze

Fine-tune

Results

Fine-tuning with a broader label set

From sratch

Random FC weights

Non-random FC weights

Results

Testing/Inference

Pre-trained models

Citation

Acknowledgements

References

About

Resources

License

Stars

Watchers

Forks

Languages