Visual speech recognition by recurrent neural networks

In visual speech recognition (VSR), speech is transcribed using only visual information to interpret tongue and teeth movements. Recently, deep learning has shown outstanding performance in VSR, with accuracy exceeding that of lipreaders on benchmark datasets. However, several problems still exist when using VSR systems. A major challenge is the distinction of words with similar pronunciation, called homophones; these lead to word ambiguity. Another technical limitation of traditional VSR systems is that visual information does not provide sufficient data for learning words such as “a”, “an”, “eight”, and “bin” because their lengths are shorter than 0.02 s. This report proposes a novel lipreading architecture that combines three different convolutional neural networks (CNNs; a 3D CNN, a densely connected 3D CNN, and a multi-layer feature fusion 3D CNN), which are followed by a two-layer bi-directional gated recurrent unit. The entire network was trained using connectionist temporal classification. The results of the standard automatic speech recognition evaluation metrics show that the proposed architecture reduced the character and word error rates of the baseline model by 5.681% and 11.282%, respectively, for the unseen-speaker dataset. Our proposed architecture exhibits improved performance even when visual ambiguity arises, thereby increasing VSR reliability for practical applications.

Download Full-text

A Monte Carlo EM Approach for Partially Observable Diffusion Processes: Theory and Applications to Neural Networks

Neural Computation ◽

10.1162/08997660260028593 ◽

2002 ◽

Vol 14 (7) ◽

pp. 1507-1544 ◽

Cited By ~ 14

Author(s):

Javier R. Movellan ◽

Paul Mineiro ◽

R. J. Williams

Keyword(s):

Neural Networks ◽

Monte Carlo ◽

Speech Recognition ◽

Recurrent Neural Networks ◽

Diffusion Processes ◽

Visual Speech ◽

Stochastic Version ◽

Inner Products ◽

Partially Observable ◽

Monte Carlo Em

We present a Monte Carlo approach for training partially observable diffusion processes. We apply the approach to diffusion networks, a stochastic version of continuous recurrent neural networks. The approach is aimed at learning probability distributions of continuous paths, not just expected values. Interestingly, the relevant activation statistics used by the learning rule presented here are inner products in the Hilbert space of square integrable functions. These inner products can be computed using Hebbian operations and do not require backpropagation of error signals. Moreover, standard kernel methods could potentially be applied to compute such inner products. We propose that the main reason that recurrent neural networks have not worked well in engineering applications (e.g., speech recognition) is that they implicitly rely on a very simplistic likelihood model. The diffusion network approach proposed here is much richer and may open new avenues for applications of recurrent neural networks. We present some analysis and simulations to support this view. Very encouraging results were obtained on a visual speech recognition task in which neural networks outperformed hidden Markov models.

Download Full-text

Audio-Visual Speech Recognition using 3D Convolutional Neural Networks

10.1109/asyu52992.2021.9599016 ◽

2021 ◽

Author(s):

Ceren Belhan ◽

Damla Fikirdanis ◽

Ovgu Cimen ◽

Pelin Pasinli ◽

Zeynep Akgun ◽

...

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Convolutional Neural Networks ◽

Visual Speech ◽

Visual Speech Recognition

Download Full-text

Improving the Recognition Performance of Lip Reading Using the Concatenated Three Sequence Keyframe Image Technique

Engineering, Technology & Applied Science Research ◽

10.48084/etasr.4102 ◽

2021 ◽

Vol 11 (2) ◽

pp. 6986-6992

Author(s):

L. Poomhiran ◽

P. Meesad ◽

S. Nuanmeesri

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Convolutional Neural Networks ◽

Recognition Performance ◽

Visual Speech ◽

Image Frame ◽

Visual Speech Recognition ◽

Lip Reading ◽

Image Technique ◽

Accuracy Validation

This paper proposes a lip reading method based on convolutional neural networks applied to Concatenated Three Sequence Keyframe Image (C3-SKI), consisting of (a) the Start-Lip Image (SLI), (b) the Middle-Lip Image (MLI), and (c) the End-Lip Image (ELI) which is the end of the pronunciation of that syllable. The lip area’s image dimensions were reduced to 32×32 pixels per image frame and three keyframes concatenate together were used to represent one syllable with a dimension of 96×32 pixels for visual speech recognition. Every three concatenated keyframes representing any syllable are selected based on the relative maximum and relative minimum related to the open lip’s width and height. The evaluation results of the model’s effectiveness, showed accuracy, validation accuracy, loss, and validation loss values at 95.06%, 86.03%, 4.61%, and 9.04% respectively, for the THDigits dataset. The C3-SKI technique was also applied to the AVDigits dataset, showing 85.62% accuracy. In conclusion, the C3-SKI technique could be applied to perform lip reading recognition.

Download Full-text