LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES

Fatemeh Vakhshiteh; Farshad Almasganj; Ahmad Nickabadi

doi:10.5566/ias.1859

LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES

Image Analysis & Stereology ◽

10.5566/ias.1859 ◽

2018 ◽

Vol 37 (2) ◽

pp. 159 ◽

Cited By ~ 2

Author(s):

Fatemeh Vakhshiteh ◽

Farshad Almasganj ◽

Ahmad Nickabadi

Keyword(s):

Speech Intelligibility ◽

Deep Neural Networks ◽

Visual Speech ◽

Visual Features ◽

Noisy Environments ◽

Phone Recognition ◽

Facial Information ◽

Visual Speech Recognition ◽

Lip Reading ◽

Reading System

Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works.

Download Full-text

A NOVEL TASK-ORIENTED APPROACH TOWARD AUTOMATED LIP-READING SYSTEM IMPLEMENTATION

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xliv-2-w1-2021-85-2021 ◽

2021 ◽

Vol XLIV-2/W1-2021 ◽

pp. 85-89

Author(s):

D. Ivanko ◽

D. Ryumin

Keyword(s):

Speech Recognition ◽

Visual Information ◽

Visual Speech ◽

System Implementation ◽

Visual Speech Recognition ◽

Rough Approximation ◽

Lip Reading ◽

Reading System ◽

Task Oriented ◽

Oriented Approach

Abstract. Visual information plays a key role in automatic speech recognition (ASR) when audio is corrupted by background noise, or even inaccessible. Speech recognition using visual information is called lip-reading. The initial idea of visual speech recognition comes from humans’ experience: we are able to recognize spoken words from the observation of a speaker's face without or with limited access to the sound part of the voice. Based on the conducted experimental evaluations as well as on analysis of the research field we propose a novel task-oriented approach towards practical lip-reading system implementation. Its main purpose is to be some kind of a roadmap for researchers who need to build a reliable visual speech recognition system for their task. In a rough approximation, we can divide the task of lip-reading into two parts, depending on the complexity of the problem. First, if we need to recognize isolated words, numbers or small phrases (e.g. Telephone numbers with a strict grammar or keywords). Or second, if we need to recognize continuous speech (phrases or sentences). All these stages disclosed in detail in this paper. Based on the proposed approach we implemented from scratch automatic visual speech recognition systems of three different architectures: GMM-CHMM, DNN-HMM and purely End-to-end. A description of the methodology, tools, step-by-step development and all necessary parameters are disclosed in detail in current paper. It is worth noting that for the Russian speech recognition, such systems were created for the first time.

Download Full-text

APPEARANCE FEATURE EXTRACTION VERSUS IMAGE TRANSFORM-BASED APPROACH FOR VISUAL SPEECH RECOGNITION

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026806001800 ◽

2006 ◽

Vol 06 (01) ◽

pp. 101-122 ◽

Cited By ~ 4

Author(s):

ALAA SAGHEER ◽

NAOYUKI TSURUTA ◽

RIN-ICHIRO TANIGUCHI ◽

SAKASHI MAEDA

Keyword(s):

Feature Extraction ◽

Feature Space ◽

Visual Speech ◽

Visual Speech Recognition ◽

Lip Reading ◽

Speech Feature ◽

Image Transform ◽

Reading System ◽

Two Stages ◽

Better Than

In this paper we propose a new appearance based system which consists of two stages: visual speech feature extraction and classification, followed by recognition of the extracted feature, thereby the result is a complete lip-reading system. This lip-reading system employs our Hyper Column Model (HCM) approach to extract and classify the visual features and uses the Hidden Markov Model (HMM) for recognition. This paper addresses mainly the first stage; i.e. feature extraction and classification. We investigate the HCM performance to achieve feature extraction and classification and then compare the performance when replacing HCM with Fast Discrete Cosine Transform (FDCT). Unlike FDCT, HCM could extract the entire features without any loss. Also the experiments have shown that HCM is generally better than FDCT and provides a good distribution of the phonemes in the feature space for recognition purposes. For fair comparison, two databases are exploited with three different sets of resolution for each database. One of these two databases is designed to include shifted and scaled objects. Experiments reveal that HCM is capable of recovering and dealing with such image restrictions whereas the effectiveness of FDCT drops drastically especially for new subjects.

Download Full-text

The Performance and Classifications of Audio-Visual Speech Recognition by Using the Dynamic Visual Features Extractions

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2019/31852019 ◽

2019 ◽

Vol 8 (5) ◽

pp. 2049-2053

Author(s):

Muhammad Ismail Mohmand ◽

Keyword(s):

Speech Recognition ◽

Visual Speech ◽

Visual Features ◽

Visual Speech Recognition

Download Full-text

An Experimental Analysis of Different Approaches to Audio–Visual Speech Recognition and Lip-Reading

Proceedings of 15th International Conference on Electromechanics and Robotics "Zavalishin's Readings" - Smart Innovation, Systems and Technologies ◽

10.1007/978-981-15-5580-0_16 ◽

2020 ◽

pp. 197-209

Author(s):

Denis Ivanko ◽

Dmitry Ryumin ◽

Alexey Karpov

Keyword(s):

Speech Recognition ◽

Experimental Analysis ◽

Visual Speech ◽

Visual Speech Recognition ◽

Lip Reading

Download Full-text

Motion Features for Visual Speech Recognition

Visual Speech Recognition ◽

10.4018/978-1-60566-186-5.ch013 ◽

2009 ◽

pp. 388-415 ◽

Cited By ~ 1

Author(s):

Wai Chee Yau ◽

Dinesh Kant Kumar ◽

Hans Weghorn

Keyword(s):

Speech Recognition ◽

Optical Flow ◽

Visual Speech ◽

Image Subtraction ◽

Phoneme Classification ◽

Visual Speech Recognition ◽

Lip Reading ◽

Motion Features ◽

Spatio Temporal ◽

Mouth Movement

The performance of a visual speech recognition technique is greatly influenced by the choice of visual speech features. Speech information in the visual domain can be generally categorized into static (mouth appearance) and motion (mouth movement) features. This chapter reviews a number of computer-based lip-reading approaches using motion features. The motion-based visual speech recognition techniques can be broadly categorized into two types of algorithms: optical-flow and image subtraction. Image subtraction techniques have been demonstrated to outperform optical-flow based methods in lip-reading. The problem with image subtraction-based method using difference of frames (DOF) is that these features capture the changes in the images over time, but do not indicate the direction of the mouth movement. New motion features to overcome the limitation of the conventional image subtraction-based techniques in visual speech recognition are presented in this chapter. The proposed approach extracts features by applying motion segmentation on image sequences. Video data are represented in a 2-D space using grayscale images named as motion history images (MHI). MHIs are spatio-temporal templates that implicitly encode the temporal component of mouth movement. Zernike moments are computed from MHIs as image descriptors and classified using support vector machines (SVMs). Experimental results demonstrate that the proposed technique yield a high accuracy in a phoneme classification task. The results suggest that dynamic information is important for visual speech recognition.

Download Full-text

A Novel Method to Extract Lip-Reading Features by Using LGEI and DWT

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.1079-1080.820 ◽

2014 ◽

Vol 1079-1080 ◽

pp. 820-823

Author(s):

Li Guo Zheng ◽

Mei Li Zhu ◽

Qing Qing Wang

Keyword(s):

Visual Speech ◽

Second Step ◽

Discrete Wavelet ◽

Noise Resistance ◽

Lip Reading ◽

Novel Method ◽

Reading System ◽

Visual Speech Information ◽

Speech Information ◽

Precision Rate

This paper proposes a novel algorithm used in extraction of lip feature extraction for to improved efficiency and robustness of lip-reading system. First, Lip Gray Energy Image (LGEI) is used to smooth noise, and improve noise resistance of the system. Second, Discrete Wavelet Analysis (DWT) is used to extract salient visual speech information from lip by decorrelating spectral information. Last, lip features are obtained by downsampling data from second step, the resample can effectively reduce the amount of computation. Experimental results show the performance of this method is exceedingly discriminative, accurate and computation efficient, the precision rate can rate 96%.

Download Full-text

Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2017.7953172 ◽

2017 ◽

Cited By ~ 10

Author(s):

Hendrik Meutzner ◽

Ning Ma ◽

Robert Nickel ◽

Christopher Schymura ◽

Dorothea Kolossa

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Deep Neural Networks ◽

Visual Speech ◽

Visual Speech Recognition ◽

Reliability Estimates

Download Full-text

Dynamic visual features based on discriminative speech class projection for visual speech recognition

Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2004. ◽

10.1109/isimp.2004.1434157 ◽

2005 ◽

Author(s):

Xie Lei ◽

Cai Xiu-li ◽

Fu Zhong-hua ◽

Zhao Rong-chan

Keyword(s):

Speech Recognition ◽

Visual Speech ◽

Visual Features ◽

Visual Speech Recognition

Download Full-text

Review on Automatic Lip Reading Techniques

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001418560074 ◽

2018 ◽

Vol 32 (07) ◽

pp. 1856007 ◽

Cited By ~ 4

Author(s):

Yuanyao Lu ◽

Jie Yan ◽

Ke Gu

Keyword(s):

Research Direction ◽

Visual Speech ◽

Computer Interface ◽

Future Research ◽

Human Computer Interface ◽

Future Research Direction ◽

Significant Component ◽

Lip Reading ◽

Recognition Systems ◽

Reading System

As a significant component of the Human Computer Interface (HCI), automatic lip reading is designed for the purpose of understanding the content of speech by interpreting the movements of the lips. Although performance of automatic lip reading system is easily affected by challenging conditions such as noise, illumination and low resolution, enormous advancements in the relevant fields accompanied with enhancement in computer capability have improved the robustness of the system, making it more adaptable to the real environment. In this paper, we study the field and gives a detailed discussion on the actuality and the developing level of automatic lip reading. We emphatically introduce the feature extraction and recognition model algorithms. We also compare and analyze the various visual speech databases for their characteristics and functions in speech recognition systems. In addition, we describe the challenges and offer our insights into future research direction of automatic lip reading.

Download Full-text

Towards Estimating the Upper Bound of Visual-Speech Recognition: The Visual Lip-Reading Feasibility Database

2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) ◽

10.1109/fg.2017.34 ◽

2017 ◽

Cited By ~ 7

Author(s):

Adriana Fernandez-Lopez ◽

Oriol Martinez ◽

Federico M. Sukno

Keyword(s):

Speech Recognition ◽

Upper Bound ◽

Visual Speech ◽

Visual Speech Recognition ◽

Lip Reading

Download Full-text