scholarly journals Towards Estimating the Upper Bound of Visual-Speech Recognition: The Visual Lip-Reading Feasibility Database

Author(s):  
Adriana Fernandez-Lopez ◽  
Oriol Martinez ◽  
Federico M. Sukno
2009 ◽  
pp. 388-415 ◽  
Author(s):  
Wai Chee Yau ◽  
Dinesh Kant Kumar ◽  
Hans Weghorn

The performance of a visual speech recognition technique is greatly influenced by the choice of visual speech features. Speech information in the visual domain can be generally categorized into static (mouth appearance) and motion (mouth movement) features. This chapter reviews a number of computer-based lip-reading approaches using motion features. The motion-based visual speech recognition techniques can be broadly categorized into two types of algorithms: optical-flow and image subtraction. Image subtraction techniques have been demonstrated to outperform optical-flow based methods in lip-reading. The problem with image subtraction-based method using difference of frames (DOF) is that these features capture the changes in the images over time, but do not indicate the direction of the mouth movement. New motion features to overcome the limitation of the conventional image subtraction-based techniques in visual speech recognition are presented in this chapter. The proposed approach extracts features by applying motion segmentation on image sequences. Video data are represented in a 2-D space using grayscale images named as motion history images (MHI). MHIs are spatio-temporal templates that implicitly encode the temporal component of mouth movement. Zernike moments are computed from MHIs as image descriptors and classified using support vector machines (SVMs). Experimental results demonstrate that the proposed technique yield a high accuracy in a phoneme classification task. The results suggest that dynamic information is important for visual speech recognition.


Author(s):  
D. Ivanko ◽  
D. Ryumin

Abstract. Visual information plays a key role in automatic speech recognition (ASR) when audio is corrupted by background noise, or even inaccessible. Speech recognition using visual information is called lip-reading. The initial idea of visual speech recognition comes from humans’ experience: we are able to recognize spoken words from the observation of a speaker's face without or with limited access to the sound part of the voice. Based on the conducted experimental evaluations as well as on analysis of the research field we propose a novel task-oriented approach towards practical lip-reading system implementation. Its main purpose is to be some kind of a roadmap for researchers who need to build a reliable visual speech recognition system for their task. In a rough approximation, we can divide the task of lip-reading into two parts, depending on the complexity of the problem. First, if we need to recognize isolated words, numbers or small phrases (e.g. Telephone numbers with a strict grammar or keywords). Or second, if we need to recognize continuous speech (phrases or sentences). All these stages disclosed in detail in this paper. Based on the proposed approach we implemented from scratch automatic visual speech recognition systems of three different architectures: GMM-CHMM, DNN-HMM and purely End-to-end. A description of the methodology, tools, step-by-step development and all necessary parameters are disclosed in detail in current paper. It is worth noting that for the Russian speech recognition, such systems were created for the first time.


2021 ◽  
Author(s):  
Shashidhar R ◽  
Sudarshan Patil Kulkarni

Abstract In the current scenario, audio visual speech recognition is one of the emerging fields of research, but there is still deficiency of appropriate visual features for recognition of visual speech. Human lip-readers are increasingly being presented as useful in the gathering of forensic evidence but, like all human, suffer from unreliability in analyzing the lip movement. Here we used a custom dataset and design the system in such a way that it predicts the output for the lip reading. The problem of speaker independent lip-reading is very demanding due to unpredictable variations between people. Also due to recent developments and advances in the fields of signal processing and computer vision. The task of automating the lip reading is becoming a field of great interest. Here we use MFCC techniques for audio processing and LSTM method for visual speech recognition and finally integrate the audio and video using feed forward neural network (FFNN) and also got good accuracy. That is why the AVSR technique attract a great attention as a reliable solution for the speech detection problem. The final model was capable of taking more appropriate decision while predicting the spoken word. We were able to get a good accuracy of about 92.38% for the final model.


2021 ◽  
Vol 11 (2) ◽  
pp. 6986-6992
Author(s):  
L. Poomhiran ◽  
P. Meesad ◽  
S. Nuanmeesri

This paper proposes a lip reading method based on convolutional neural networks applied to Concatenated Three Sequence Keyframe Image (C3-SKI), consisting of (a) the Start-Lip Image (SLI), (b) the Middle-Lip Image (MLI), and (c) the End-Lip Image (ELI) which is the end of the pronunciation of that syllable. The lip area’s image dimensions were reduced to 32×32 pixels per image frame and three keyframes concatenate together were used to represent one syllable with a dimension of 96×32 pixels for visual speech recognition. Every three concatenated keyframes representing any syllable are selected based on the relative maximum and relative minimum related to the open lip’s width and height. The evaluation results of the model’s effectiveness, showed accuracy, validation accuracy, loss, and validation loss values at 95.06%, 86.03%, 4.61%, and 9.04% respectively, for the THDigits dataset. The C3-SKI technique was also applied to the AVDigits dataset, showing 85.62% accuracy. In conclusion, the C3-SKI technique could be applied to perform lip reading recognition.


Author(s):  
Guillaume Gravier ◽  
Gerasimos Potamianos ◽  
Chalapathy Neti

Sign in / Sign up

Export Citation Format

Share Document