Skip to content

Multimodal speech recognition using lipreading (with CNNs) and audio (using LSTMs). Sensor fusion is done with an attention network.

License

Notifications You must be signed in to change notification settings

matthijsvk/multimodalSR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is the repository containing most of the code for my thesis 'Design, Implementation and Analysis of a Deep Convolutional-Recurrent Neural Network for Speech Recognition throuth Audiovisual Sensor Fusion' at the ESAT (Electrical Engineering) Department of KU Leuven (2016-2017).

Author: Matthijs Van keirsbilck
Supervisor: Bert Moons
Promotor: Marian Verhelst

The code and thesis text are bound by the KU Leuven's Student Thesis Copyright Regulations.


The CNN-LSTM networks for lipreading are combined with LSTM networks for audio recognition through an attention mechanism.
These networks achieve state-of-the-art phoneme recognition performance on the publicly available audio-visual dataset TCD-TIMIT. Systems that rely only audio suffer greatly when audio quality is lowered by noise, as is often the case in real-life situations.
This performance loss can be greatly mitigated by adding visual information.
The CNN-LSTM neural networks acieve 68.46% correctness compared to the 57.85% baseline.
Audio-only neural networks achieve 67.03% compared to 65.47% in the baseline.
Lipreading-audio combination networks achieve 75.70% accuracy for clean audio, and 58.55% for audio with an SNR of 0dB. The baseline multimodal network achieved 59% and 44% for clean and noisy audio, respectively.


The networks are implemented using Lasagne.
There is room for improvement of the code; I'll try to improve it if I can find the time.

For the downloading, preprocessing etc of the dataset: see https://github.com/matthijsvk/TCDTIMITprocessing
For the lipreading networks, see the folder code/lipreading
For the audio speech recognition networks, see code/audioSR
For the combination networks see code/combinedSR

Thanks to the authors of all the data and software used in this work. An inexhaustive list:

To Set up Python, I recommend using Anaconda. You can use the provided environment.yml to install all python packages (although some aren't used anymore).
For the installation of Theano/Lasagne and CUDA, I recommend following this tutorial.

If you find this thesis or code useful, please cite according to the following bib entry

@MastersThesis{Vankeirsbilck:Thesis:2017,
    author     =     {Matthijs Van keirsbilck},
    title     =     {{Design, implementation and analysis of a deep convolutional-recurrent neural network for speech recognition through audiovisual sensor fusion}},
    school     =     {KU Leuven},
    address     =     {Belgium},
    year     =     {2017},
    }

About

Multimodal speech recognition using lipreading (with CNNs) and audio (using LSTMs). Sensor fusion is done with an attention network.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published