Towards Estimating the Upper Bound of Visual-Speech Recognition: The Visual Lip-Reading Feasibility Database

The performance of a visual speech recognition technique is greatly influenced by the choice of visual speech features. Speech information in the visual domain can be generally categorized into static (mouth appearance) and motion (mouth movement) features. This chapter reviews a number of computer-based lip-reading approaches using motion features. The motion-based visual speech recognition techniques can be broadly categorized into two types of algorithms: optical-flow and image subtraction. Image subtraction techniques have been demonstrated to outperform optical-flow based methods in lip-reading. The problem with image subtraction-based method using difference of frames (DOF) is that these features capture the changes in the images over time, but do not indicate the direction of the mouth movement. New motion features to overcome the limitation of the conventional image subtraction-based techniques in visual speech recognition are presented in this chapter. The proposed approach extracts features by applying motion segmentation on image sequences. Video data are represented in a 2-D space using grayscale images named as motion history images (MHI). MHIs are spatio-temporal templates that implicitly encode the temporal component of mouth movement. Zernike moments are computed from MHIs as image descriptors and classified using support vector machines (SVMs). Experimental results demonstrate that the proposed technique yield a high accuracy in a phoneme classification task. The results suggest that dynamic information is important for visual speech recognition.

Download Full-text

A NOVEL TASK-ORIENTED APPROACH TOWARD AUTOMATED LIP-READING SYSTEM IMPLEMENTATION

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xliv-2-w1-2021-85-2021 ◽

2021 ◽

Vol XLIV-2/W1-2021 ◽

pp. 85-89

Author(s):

D. Ivanko ◽

D. Ryumin

Keyword(s):

Speech Recognition ◽

Visual Information ◽

Visual Speech ◽

System Implementation ◽

Visual Speech Recognition ◽

Rough Approximation ◽

Lip Reading ◽

Reading System ◽

Task Oriented ◽

Oriented Approach

Abstract. Visual information plays a key role in automatic speech recognition (ASR) when audio is corrupted by background noise, or even inaccessible. Speech recognition using visual information is called lip-reading. The initial idea of visual speech recognition comes from humans’ experience: we are able to recognize spoken words from the observation of a speaker's face without or with limited access to the sound part of the voice. Based on the conducted experimental evaluations as well as on analysis of the research field we propose a novel task-oriented approach towards practical lip-reading system implementation. Its main purpose is to be some kind of a roadmap for researchers who need to build a reliable visual speech recognition system for their task. In a rough approximation, we can divide the task of lip-reading into two parts, depending on the complexity of the problem. First, if we need to recognize isolated words, numbers or small phrases (e.g. Telephone numbers with a strict grammar or keywords). Or second, if we need to recognize continuous speech (phrases or sentences). All these stages disclosed in detail in this paper. Based on the proposed approach we implemented from scratch automatic visual speech recognition systems of three different architectures: GMM-CHMM, DNN-HMM and purely End-to-end. A description of the methodology, tools, step-by-step development and all necessary parameters are disclosed in detail in current paper. It is worth noting that for the Russian speech recognition, such systems were created for the first time.

Download Full-text

Integration of Audio video Speech Recognition using LSTM and Feed Forward Convolutional Neural Network

10.21203/rs.3.rs-173380/v1 ◽

2021 ◽

Author(s):

Shashidhar R ◽

Sudarshan Patil Kulkarni

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Good Accuracy ◽

Visual Speech ◽

Feed Forward Neural Network ◽

Final Model ◽

Feed Forward ◽

Speech Detection ◽

Visual Speech Recognition ◽

Lip Reading

Abstract In the current scenario, audio visual speech recognition is one of the emerging fields of research, but there is still deficiency of appropriate visual features for recognition of visual speech. Human lip-readers are increasingly being presented as useful in the gathering of forensic evidence but, like all human, suffer from unreliability in analyzing the lip movement. Here we used a custom dataset and design the system in such a way that it predicts the output for the lip reading. The problem of speaker independent lip-reading is very demanding due to unpredictable variations between people. Also due to recent developments and advances in the fields of signal processing and computer vision. The task of automating the lip reading is becoming a field of great interest. Here we use MFCC techniques for audio processing and LSTM method for visual speech recognition and finally integrate the audio and video using feed forward neural network (FFNN) and also got good accuracy. That is why the AVSR technique attract a great attention as a reliable solution for the speech detection problem. The final model was capable of taking more appropriate decision while predicting the spoken word. We were able to get a good accuracy of about 92.38% for the final model.

Download Full-text

The Geometrical Based Lip-Reading Techniques of Multi-Dimensional Dynamic Time Warping MDTW and Hidden Markov Models HMMs in the Audio Visual Speech Recognition

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2020/68912020 ◽

2020 ◽

Vol 9 (1) ◽

pp. 496-504

Author(s):

Muhammad Ismail Mohmand

Keyword(s):

Speech Recognition ◽

Hidden Markov Models ◽

Dynamic Time Warping ◽

Markov Models ◽

Hidden Markov ◽

Visual Speech ◽

Time Warping ◽

Visual Speech Recognition ◽

Lip Reading ◽

Dynamic Time

Download Full-text

Improving lip-reading with feature space transforms for multi-stream audio-visual speech recognition

10.21437/interspeech.2005-373 ◽

2005 ◽

Author(s):

Jing Huang ◽

Karthik Visweswariah

Keyword(s):

Speech Recognition ◽

Feature Space ◽

Visual Speech ◽

Visual Speech Recognition ◽

Lip Reading

Download Full-text

Improving the Recognition Performance of Lip Reading Using the Concatenated Three Sequence Keyframe Image Technique

Engineering, Technology & Applied Science Research ◽

10.48084/etasr.4102 ◽

2021 ◽

Vol 11 (2) ◽

pp. 6986-6992

Author(s):

L. Poomhiran ◽

P. Meesad ◽

S. Nuanmeesri

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Convolutional Neural Networks ◽

Recognition Performance ◽

Visual Speech ◽

Image Frame ◽

Visual Speech Recognition ◽

Lip Reading ◽

Image Technique ◽

Accuracy Validation

This paper proposes a lip reading method based on convolutional neural networks applied to Concatenated Three Sequence Keyframe Image (C3-SKI), consisting of (a) the Start-Lip Image (SLI), (b) the Middle-Lip Image (MLI), and (c) the End-Lip Image (ELI) which is the end of the pronunciation of that syllable. The lip area’s image dimensions were reduced to 32×32 pixels per image frame and three keyframes concatenate together were used to represent one syllable with a dimension of 96×32 pixels for visual speech recognition. Every three concatenated keyframes representing any syllable are selected based on the relative maximum and relative minimum related to the open lip’s width and height. The evaluation results of the model’s effectiveness, showed accuracy, validation accuracy, loss, and validation loss values at 95.06%, 86.03%, 4.61%, and 9.04% respectively, for the THDigits dataset. The C3-SKI technique was also applied to the AVDigits dataset, showing 85.62% accuracy. In conclusion, the C3-SKI technique could be applied to perform lip reading recognition.

Download Full-text

Asynchrony modeling for audio-visual speech recognition

10.3115/1289189.1289244 ◽

2002 ◽

Cited By ~ 31

Author(s):

Guillaume Gravier ◽

Gerasimos Potamianos ◽

Chalapathy Neti

Keyword(s):

Speech Recognition ◽

Visual Speech ◽

Visual Speech Recognition

Download Full-text

Measuring the effect of high-speed video data on the audio-visual speech recognition accuracy

Information and Control Systems ◽

10.31799/1684-8853-2019-2-26-34 ◽

2019 ◽

pp. 26-34

Author(s):

D. V. Ivanko ◽

D. A. Ryumin ◽

A. A. Karpov ◽

M. Zelezny

Keyword(s):

Speech Recognition ◽

High Speed ◽

Recognition Accuracy ◽

Video Data ◽

Visual Speech ◽

High Speed Video ◽

Visual Speech Recognition

Download Full-text

Visual speech recognition for small scale dataset using VGG16 convolution neural network

Multimedia Tools and Applications ◽

10.1007/s11042-021-11119-0 ◽

2021 ◽

Author(s):

Shashidhar R ◽

Sudarshan Patilkulkarni

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Convolution Neural Network ◽

Visual Speech ◽

Small Scale ◽

Visual Speech Recognition

Download Full-text