The effect of utterance duration on visual‐speech intelligibility scores

Jean‐Pierre Gagne; Lina Boutin

doi:10.1121/1.417402

LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES

Image Analysis & Stereology ◽

10.5566/ias.1859 ◽

2018 ◽

Vol 37 (2) ◽

pp. 159 ◽

Cited By ~ 2

Author(s):

Fatemeh Vakhshiteh ◽

Farshad Almasganj ◽

Ahmad Nickabadi

Keyword(s):

Speech Intelligibility ◽

Deep Neural Networks ◽

Visual Speech ◽

Visual Features ◽

Noisy Environments ◽

Phone Recognition ◽

Facial Information ◽

Visual Speech Recognition ◽

Lip Reading ◽

Reading System

Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works.

Download Full-text

Toward a Model of Auditory-Visual Speech Intelligibility

Multisensory Processes - Springer Handbook of Auditory Research ◽

10.1007/978-3-030-10461-0_3 ◽

2019 ◽

pp. 33-57 ◽

Cited By ~ 3

Author(s):

Ken W. Grant ◽

Joshua G. W. Bernstein

Keyword(s):

Speech Intelligibility ◽

Visual Speech

Download Full-text

Lip movements entrain the observers’ low-frequency brain oscillations to facilitate speech intelligibility

eLife ◽

10.7554/elife.14521 ◽

2016 ◽

Vol 5 ◽

Cited By ~ 65

Author(s):

Hyojin Park ◽

Christoph Kayser ◽

Gregor Thut ◽

Joachim Gross

Keyword(s):

Visual Cortex ◽

Speech Processing ◽

Speech Intelligibility ◽

Brain Activity ◽

Low Frequency ◽

Visual Speech ◽

Visual Signals ◽

Partial Coherence ◽

Auditory Speech ◽

Oscillatory Brain Activity

During continuous speech, lip movements provide visual temporal signals that facilitate speech processing. Here, using MEG we directly investigated how these visual signals interact with rhythmic brain activity in participants listening to and seeing the speaker. First, we investigated coherence between oscillatory brain activity and speaker’s lip movements and demonstrated significant entrainment in visual cortex. We then used partial coherence to remove contributions of the coherent auditory speech signal from the lip-brain coherence. Comparing this synchronization between different attention conditions revealed that attending visual speech enhances the coherence between activity in visual cortex and the speaker’s lips. Further, we identified a significant partial coherence between left motor cortex and lip movements and this partial coherence directly predicted comprehension accuracy. Our results emphasize the importance of visually entrained and attention-modulated rhythmic brain activity for the enhancement of audiovisual speech processing.

Download Full-text

Contributions of local speech encoding and functional connectivity to audio-visual speech perception

eLife ◽

10.7554/elife.24763 ◽

2017 ◽

Vol 6 ◽

Cited By ~ 36

Author(s):

Bruno L Giordano ◽

Robin A A Ince ◽

Joachim Gross ◽

Philippe G Schyns ◽

Stefano Panzeri ◽

...

Keyword(s):

Functional Connectivity ◽

Frontal Cortex ◽

Speech Intelligibility ◽

Brain Activity ◽

Visual Speech ◽

Speech Comprehension ◽

Low Snr ◽

Speech Encoding ◽

Inferior Frontal Cortex ◽

Underlying Network

Seeing a speaker’s face enhances speech intelligibility in adverse environments. We investigated the underlying network mechanisms by quantifying local speech representations and directed connectivity in MEG data obtained while human participants listened to speech of varying acoustic SNR and visual context. During high acoustic SNR speech encoding by temporally entrained brain activity was strong in temporal and inferior frontal cortex, while during low SNR strong entrainment emerged in premotor and superior frontal cortex. These changes in local encoding were accompanied by changes in directed connectivity along the ventral stream and the auditory-premotor axis. Importantly, the behavioral benefit arising from seeing the speaker’s face was not predicted by changes in local encoding but rather by enhanced functional connectivity between temporal and inferior frontal cortex. Our results demonstrate a role of auditory-frontal interactions in visual speech representations and suggest that functional connectivity along the ventral pathway facilitates speech comprehension in multisensory environments.

Download Full-text

Audio-Visual Speech Intelligibility Benefits with Bilateral Cochlear Implants when Talker Location Varies

Journal of the Association for Research in Otolaryngology ◽

10.1007/s10162-014-0503-7 ◽

2015 ◽

Vol 16 (2) ◽

pp. 309-315 ◽

Cited By ~ 10

Author(s):

Richard J. M. van Hoesel

Keyword(s):

Cochlear Implants ◽

Speech Intelligibility ◽

Visual Speech ◽

Bilateral Cochlear Implants

Download Full-text

Contribution of Oral Periphery on Visual Speech Intelligibility

Advances in Computing and Communications - Communications in Computer and Information Science ◽

10.1007/978-3-642-22714-1_20 ◽

2011 ◽

pp. 183-190

Author(s):

Preety Singh ◽

Deepika Gupta ◽

V. Laxmi ◽

M. S. Gaur

Keyword(s):

Speech Intelligibility ◽

Visual Speech

Download Full-text

Visual Speech Benefit in Clear and Degraded Speech Depends on the Auditory Intelligibility of the Talker and the Number of Background Talkers

Trends in Hearing ◽

10.1177/2331216519837866 ◽

2019 ◽

Vol 23 ◽

pp. 233121651983786 ◽

Cited By ~ 2

Author(s):

Catherine L. Blackburn ◽

Pádraig T. Kitterick ◽

Gary Jones ◽

Christian J. Sumner ◽

Paula C. Stacey

Keyword(s):

Speech Intelligibility ◽

Noise Signal ◽

Sine Wave ◽

Theory Model ◽

Visual Speech ◽

Clear Speech ◽

Visual Integration ◽

Degraded Speech ◽

The Face ◽

Vocoded Speech

Perceiving speech in background noise presents a significant challenge to listeners. Intelligibility can be improved by seeing the face of a talker. This is of particular value to hearing impaired people and users of cochlear implants. It is well known that auditory-only speech understanding depends on factors beyond audibility. How these factors impact on the audio-visual integration of speech is poorly understood. We investigated audio-visual integration when either the interfering background speech (Experiment 1) or intelligibility of the target talkers (Experiment 2) was manipulated. Clear speech was also contrasted with sine-wave vocoded speech to mimic the loss of temporal fine structure with a cochlear implant. Experiment 1 showed that for clear speech, the visual speech benefit was unaffected by the number of background talkers. For vocoded speech, a larger benefit was found when there was only one background talker. Experiment 2 showed that visual speech benefit depended upon the audio intelligibility of the talker and increased as intelligibility decreased. Degrading the speech by vocoding resulted in even greater benefit from visual speech information. A single “independent noise” signal detection theory model predicted the overall visual speech benefit in some conditions but could not predict the different levels of benefit across variations in the background or target talkers. This suggests that, similar to audio-only speech intelligibility, the integration of audio-visual speech cues may be functionally dependent on factors other than audibility and task difficulty, and that clinicians and researchers should carefully consider the characteristics of their stimuli when assessing audio-visual integration.

Download Full-text

Spatial alignment between faces and voices improves selective attention to audio-visual speech

10.1101/2021.04.19.440487 ◽

2021 ◽

Author(s):

Justin T Fleming ◽

Ross K. Maddox ◽

Barbara G Shinn-Cunningham

Keyword(s):

Task Performance ◽

Speech Intelligibility ◽

Visual Speech ◽

Eye Position ◽

Noise Levels ◽

Floor Effect ◽

Closed Set ◽

High Noise ◽

Spatial Alignment ◽

The Cost

The ability to see a talker's face has long been known to improve speech intelligibility in noise. This perceptual benefit depends on approximate temporal alignment between the auditory and visual speech components. However, the practical role that cross-modal spatial alignment plays in integrating audio-visual (AV) speech remains unresolved, particularly when competing talkers are present. In a series of online experiments, we investigated the importance of spatial alignment between corresponding faces and voices using a paradigm that featured both acoustic masking (speech-shaped noise) and attentional demands from a competing talker. Participants selectively attended a Target Talker's speech, then identified a word spoken by the Target Talker. In Exp. 1, we found improved task performance when the talkers' faces were visible, but only when corresponding faces and voices were presented in the same hemifield (spatially aligned). In Exp. 2, we tested for possible influences of eye position on this result. In auditory-only conditions, directing gaze toward the distractor voice reduced performance as predicted, but this effect could not fully explain the cost of AV spatial misalignment. Finally, in Exp. 3 and 4, we show that the effect of AV spatial alignment changes with noise level, but this was limited by a floor effect: due to the use of closed-set stimuli, participants were able to perform the task relatively well using lipreading alone. However, comparison between the results of Exp. 1 and Exp. 3 suggests that the cost of AV misalignment is larger at high noise levels. Overall, these results indicate that spatial alignment between corresponding faces and voices is important for AV speech integration in attentionally demanding communication settings.

Download Full-text

Visual‐speech intelligibility for syllables: A comparison of conversational and clear speech

The Journal of the Acoustical Society of America ◽

10.1121/1.413905 ◽

1995 ◽

Vol 98 (5) ◽

pp. 2983-2983

Author(s):

Jean‐Pierre Gagné ◽

Anne‐Josée Rochette

Keyword(s):

Speech Intelligibility ◽

Visual Speech ◽

Clear Speech

Download Full-text

Audiovisual Speech Processing in Relationship to Phonological and Vocabulary Skills in First Graders

Journal of Speech Language and Hearing Research ◽

10.1044/2021_jslhr-21-00196 ◽

2021 ◽

pp. 1-19

Author(s):

Liesbeth Gijbels ◽

Jason D. Yeatman ◽

Kaylah Lalonde ◽

Adrian K. C. Lee

Keyword(s):

Individual Differences ◽

Word Recognition ◽

Phonological Awareness ◽

Speech Intelligibility ◽

Visual Cues ◽

First Graders ◽

First Grade ◽

Vocabulary Knowledge ◽

Visual Speech ◽

The Impact

Purpose It is generally accepted that adults use visual cues to improve speech intelligibility in noisy environments, but findings regarding visual speech benefit in children are mixed. We explored factors that contribute to audiovisual (AV) gain in young children's speech understanding. We examined whether there is an AV benefit to speech-in-noise recognition in children in first grade and if visual salience of phonemes influences their AV benefit. We explored if individual differences in AV speech enhancement could be explained by vocabulary knowledge, phonological awareness, or general psychophysical testing performance. Method Thirty-seven first graders completed online psychophysical experiments. We used an online single-interval, four-alternative forced-choice picture-pointing task with age-appropriate consonant–vowel–consonant words to measure auditory-only, visual-only, and AV word recognition in noise at −2 and −8 dB SNR. We obtained standard measures of vocabulary and phonological awareness and included a general psychophysical test to examine correlations with AV benefits. Results We observed a significant overall AV gain among children in first grade. This effect was mainly attributed to the benefit at −8 dB SNR, for visually distinct targets. Individual differences were not explained by any of the child variables. Boys showed lower auditory-only performances, leading to significantly larger AV gains. Conclusions This study shows AV benefit, of distinctive visual cues, to word recognition in challenging noisy conditions in first graders. The cognitive and linguistic constraints of the task may have minimized the impact of individual differences of vocabulary and phonological awareness on AV benefit. The gender difference should be studied on a larger sample and age range.

Download Full-text