scholarly journals The interrelationship between the face and vocal tract configuration during audiovisual speech

2020 ◽  
Vol 117 (51) ◽  
pp. 32791-32798
Author(s):  
Chris Scholes ◽  
Jeremy I. Skipper ◽  
Alan Johnston

It is well established that speech perception is improved when we are able to see the speaker talking along with hearing their voice, especially when the speech is noisy. While we have a good understanding of where speech integration occurs in the brain, it is unclear how visual and auditory cues are combined to improve speech perception. One suggestion is that integration can occur as both visual and auditory cues arise from a common generator: the vocal tract. Here, we investigate whether facial and vocal tract movements are linked during speech production by comparing videos of the face and fast magnetic resonance (MR) image sequences of the vocal tract. The joint variation in the face and vocal tract was extracted using an application of principal components analysis (PCA), and we demonstrate that MR image sequences can be reconstructed with high fidelity using only the facial video and PCA. Reconstruction fidelity was significantly higher when images from the two sequences corresponded in time, and including implicit temporal information by combining contiguous frames also led to a significant increase in fidelity. A “Bubbles” technique was used to identify which areas of the face were important for recovering information about the vocal tract, and vice versa,on a frame-by-frame basis. Our data reveal that there is sufficient information in the face to recover vocal tract shape during speech. In addition, the facial and vocal tract regions that are important for reconstruction are those that are used to generate the acoustic speech signal.

Author(s):  
Carol A. Fowler

The theory of speech perception as direct derives from a general direct-realist account of perception. A realist stance on perception is that perceiving enables occupants of an ecological niche to know its component layouts, objects, animals, and events. “Direct” perception means that perceivers are in unmediated contact with their niche (mediated neither by internally generated representations of the environment nor by inferences made on the basis of fragmentary input to the perceptual systems). Direct perception is possible because energy arrays that have been causally structured by niche components and that are available to perceivers specify (i.e., stand in 1:1 relation to) components of the niche. Typically, perception is multi-modal; that is, perception of the environment depends on specifying information present in, or even spanning, multiple energy arrays. Applied to speech perception, the theory begins with the observation that speech perception involves the same perceptual systems that, in a direct-realist theory, enable direct perception of the environment. Most notably, the auditory system supports speech perception, but also the visual system, and sometimes other perceptual systems. Perception of language forms (consonants, vowels, word forms) can be direct if the forms lawfully cause specifying patterning in the energy arrays available to perceivers. In Articulatory Phonology, the primitive language forms (constituting consonants and vowels) are linguistically significant gestures of the vocal tract, which cause patterning in air and on the face. Descriptions are provided of informational patterning in acoustic and other energy arrays. Evidence is next reviewed that speech perceivers make use of acoustic and cross modal information about the phonetic gestures constituting consonants and vowels to perceive the gestures. Significant problems arise for the viability of a theory of direct perception of speech. One is the “inverse problem,” the difficulty of recovering vocal tract shapes or actions from acoustic input. Two other problems arise because speakers coarticulate when they speak. That is, they temporally overlap production of serially nearby consonants and vowels so that there are no discrete segments in the acoustic signal corresponding to the discrete consonants and vowels that talkers intend to convey (the “segmentation problem”), and there is massive context-sensitivity in acoustic (and optical and other modalities) patterning (the “invariance problem”). The present article suggests solutions to these problems. The article also reviews signatures of a direct mode of speech perception, including that perceivers use cross-modal speech information when it is available and exhibit various indications of perception-production linkages, such as rapid imitation and a disposition to converge in dialect with interlocutors. An underdeveloped domain within the theory concerns the very important role of longer- and shorter-term learning in speech perception. Infants develop language-specific modes of attention to acoustic speech signals (and optical information for speech), and adult listeners attune to novel dialects or foreign accents. Moreover, listeners make use of lexical knowledge and statistical properties of the language in speech perception. Some progress has been made in incorporating infant learning into a theory of direct perception of speech, but much less progress has been made in the other areas.


1984 ◽  
Vol 29 (7) ◽  
pp. 567-568
Author(s):  
Gilles Kirouac
Keyword(s):  
The Face ◽  

2012 ◽  
Vol 2012 ◽  
pp. 1-12 ◽  
Author(s):  
Giulio Tononi ◽  
Chiara Cirelli

Sleep must serve an essential, universal function, one that offsets the risk of being disconnected from the environment. The synaptic homeostasis hypothesis (SHY) is an attempt to identify this essential function. Its core claim is that sleep is needed to reestablish synaptic homeostasis, which is challenged by the remarkable plasticity of the brain. In other words, sleep is “the price we pay for plasticity.” In this issue, M. G. Frank reviewed several aspects of the hypothesis and raised several issues. The comments below provide a brief summary of the motivations underlying SHY and clarify that SHY is a hypothesis not about specific mechanisms, but about a universal, essential function of sleep. This function is the preservation of synaptic homeostasis in the face of a systematic bias toward a net increase in synaptic strength—a challenge that is posed by learning during adult wake, and by massive synaptogenesis during development.


Author(s):  
Clairton Marcolongo Pereira ◽  
Tayná B. Silva ◽  
Laiz Zaché Roque ◽  
Bárbara Barros ◽  
Luiz Alexandre Moscon ◽  
...  
Keyword(s):  
The Face ◽  

2004 ◽  
Vol 47 (4) ◽  
pp. 784-801 ◽  
Author(s):  
David J. Zajac ◽  
Mark C. Weissler

Two studies were conducted to evaluate short-latency vocal tract air pressure responses to sudden pressure bleeds during production of voiceless bilabial stop consonants. It was hypothesized that the occurrence of respiratory reflexes would be indicated by distinct patterns of responses as a function of bleed magnitude. In Study 1, 19 adults produced syllable trains of /pʌ/ using a mouthpiece coupled to a computer-controlled perturbator. The device randomly created bleed apertures that ranged from 0 to 40 mm 2 during production of the 2nd or 4th syllable of an utterance. Although peak oral air pressure dropped in a linear manner across bleed apertures, it averaged 2 to 3 cm H 2 O at the largest bleed. While slope of oral pressure also decreased in a linear trend, duration of the oral pressure pulse remained relatively constant. The patterns suggest that respiratory reflexes, if present, have little effect on oral air pressure levels. In Study 2, both oral and subglottal air pressure responses were monitored in 2 adults while bleed apertures of 20 and 40 mm 2 were randomly created. For 1 participant, peak oral air pressure dropped across bleed apertures, as in Study 1. Subglottal air pressure and slope, however, remained relatively stable. These patterns provide some support for the occurrence of respiratory reflexes to regulate subglottal air pressure. Overall, the studies indicate that the inherent physiologic processes of the respiratory system, which may involve reflexes, and passive aeromechanical resistance of the upper airway are capable of developing oral air pressure in the face of substantial pressure bleeds. Implications for understanding speech production and the characteristics of individuals with velopharyngeal dysfunction are discussed. KEY WORDS: stop consonants, oral air pressure, subglottal air pressure, respiratory reflexes, velopharyngeal dysfunction


2012 ◽  
Vol 23 (12) ◽  
pp. 1455-1460 ◽  
Author(s):  
Lisa Legault ◽  
Timour Al-Khindi ◽  
Michael Inzlicht

Self-affirmation produces large effects: Even a simple reminder of one’s core values reduces defensiveness against threatening information. But how, exactly, does self-affirmation work? We explored this question by examining the impact of self-affirmation on neurophysiological responses to threatening events. We hypothesized that because self-affirmation increases openness to threat and enhances approachability of unfavorable feedback, it should augment attention and emotional receptivity to performance errors. We further hypothesized that this augmentation could be assessed directly, at the level of the brain. We measured self-affirmed and nonaffirmed participants’ electrophysiological responses to making errors on a task. As we anticipated, self-affirmation elicited greater error responsiveness than did nonaffirmation, as indexed by the error-related negativity, a neural signal of error monitoring. Self-affirmed participants also performed better on the task than did nonaffirmed participants. We offer novel brain evidence that self-affirmation increases openness to threat and discuss the role of error detection in the link between self-affirmation and performance.


2002 ◽  
Vol 33 (4) ◽  
pp. 237-252 ◽  
Author(s):  
Susan Nittrouer

Phoneme-sized phonetic segments are often defined as the most basic unit of language organization. Two common inferences made from this description are that there are clear correlates to phonetic segments in the acoustic speech stream, and that humans have access to these segments from birth. In fact, well-replicated studies have shown that the acoustic signal of speech lacks invariant physical correlates to phonetic segments, and that the ability to recognize segmental structure is not present from the start of language learning. Instead, the young child must learn how to process the complex, generally continuous acoustic speech signal so that phonetic structure can be derived. This paper describes and reviews experiments that have revealed developmental changes in speech perception that accompany improvements in access to phonetic structure. In addition, this paper explains how these perceptual changes appear to be related to other aspects of language development, such as syntactic abilities and reading. Finally, evidence is provided that these critical developmental changes result from adequate language experience in naturalistic contexts, and accordingly suggests that intervention strategies for children with language learning problems should focus on enhancing language experience in natural contexts.


2016 ◽  
Vol 12 (1) ◽  
pp. 20150883 ◽  
Author(s):  
Natalia Albuquerque ◽  
Kun Guo ◽  
Anna Wilkinson ◽  
Carine Savalli ◽  
Emma Otta ◽  
...  

The perception of emotional expressions allows animals to evaluate the social intentions and motivations of each other. This usually takes place within species; however, in the case of domestic dogs, it might be advantageous to recognize the emotions of humans as well as other dogs. In this sense, the combination of visual and auditory cues to categorize others' emotions facilitates the information processing and indicates high-level cognitive representations. Using a cross-modal preferential looking paradigm, we presented dogs with either human or dog faces with different emotional valences (happy/playful versus angry/aggressive) paired with a single vocalization from the same individual with either a positive or negative valence or Brownian noise. Dogs looked significantly longer at the face whose expression was congruent to the valence of vocalization, for both conspecifics and heterospecifics, an ability previously known only in humans. These results demonstrate that dogs can extract and integrate bimodal sensory emotional information, and discriminate between positive and negative emotions from both humans and dogs.


Sign in / Sign up

Export Citation Format

Share Document