scholarly journals A speech envelope landmark for syllable encoding in human superior temporal gyrus

2018 ◽  
Author(s):  
Yulia Oganian ◽  
Edward F. Chang

AbstractListeners use the slow amplitude modulations of speech, known as the envelope, to segment continuous speech into syllables. However, the underlying neural computations are heavily debated. We used high-density intracranial cortical recordings while participants listened to natural and synthesized control speech stimuli to determine how the envelope is represented in the human superior temporal gyrus (STG), a critical auditory brain area for speech processing. We found that the STG does not encode the instantaneous, moment-by-moment amplitude envelope of speech. Rather, a zone of the middle STG detects discrete acoustic onset edges, defined by local maxima in the rate-of-change of the envelope. Acoustic analysis demonstrated that acoustic onset edges reliably cue the information-rich transition between the consonant-onset and vowel-nucleus of syllables. Furthermore, the steepness of the acoustic edge cued whether a syllable was stressed. Synthesized amplitude-modulated tone stimuli showed that steeper edges elicited monotonically greater cortical responses, confirming the encoding of relative but not absolute amplitude. Overall, encoding of the timing and magnitude of acoustic onset edges in STG underlies our perception of the syllabic rhythm of speech.

2019 ◽  
Vol 5 (11) ◽  
pp. eaay6279 ◽  
Author(s):  
Yulia Oganian ◽  
Edward F. Chang

The most salient acoustic features in speech are the modulations in its intensity, captured by the amplitude envelope. Perceptually, the envelope is necessary for speech comprehension. Yet, the neural computations that represent the envelope and their linguistic implications are heavily debated. We used high-density intracranial recordings, while participants listened to speech, to determine how the envelope is represented in human speech cortical areas on the superior temporal gyrus (STG). We found that a well-defined zone in middle STG detects acoustic onset edges (local maxima in the envelope rate of change). Acoustic analyses demonstrated that timing of acoustic onset edges cues syllabic nucleus onsets, while their slope cues syllabic stress. Synthesized amplitude-modulated tone stimuli showed that steeper slopes elicited greater responses, confirming cortical encoding of amplitude change, not absolute amplitude. Overall, STG encoding of the timing and magnitude of acoustic onset edges underlies the perception of speech temporal structure.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Maya Inbar ◽  
Eitan Grossman ◽  
Ayelet N. Landau

Abstract Studies of speech processing investigate the relationship between temporal structure in speech stimuli and neural activity. Despite clear evidence that the brain tracks speech at low frequencies (~ 1 Hz), it is not well understood what linguistic information gives rise to this rhythm. In this study, we harness linguistic theory to draw attention to Intonation Units (IUs), a fundamental prosodic unit of human language, and characterize their temporal structure as captured in the speech envelope, an acoustic representation relevant to the neural processing of speech. IUs are defined by a specific pattern of syllable delivery, together with resets in pitch and articulatory force. Linguistic studies of spontaneous speech indicate that this prosodic segmentation paces new information in language use across diverse languages. Therefore, IUs provide a universal structural cue for the cognitive dynamics of speech production and comprehension. We study the relation between IUs and periodicities in the speech envelope, applying methods from investigations of neural synchronization. Our sample includes recordings from every-day speech contexts of over 100 speakers and six languages. We find that sequences of IUs form a consistent low-frequency rhythm and constitute a significant periodic cue within the speech envelope. Our findings allow to predict that IUs are utilized by the neural system when tracking speech. The methods we introduce here facilitate testing this prediction in the future (i.e., with physiological data).


2001 ◽  
Vol 13 (7) ◽  
pp. 994-1005 ◽  
Author(s):  
Athena Vouloumanos ◽  
Kent A. Kiehl ◽  
Janet F. Werker ◽  
Peter F. Liddle

The detection of speech in an auditory stream is a requisite first step in processing spoken language. In this study, we used event-related fMRI to investigate the neural substrates mediating detection of speech compared with that of nonspeech auditory stimuli. Unlike previous studies addressing this issue, we contrasted speech with nonspeech analogues that were matched along key temporal and spectral dimensions. In an oddball detection task, listeners heard nonsense speech sounds, matched sine wave analogues (complex nonspeech), or single tones (simple nonspeech). Speech stimuli elicited significantly greater activation than both complex and simple nonspeech stimuli in classic receptive language areas, namely the middle temporal gyri bilaterally and in a locus lateralized to the left posterior superior temporal gyrus. In addition, speech activated a small cluster of the right inferior frontal gyrus. The activation of these areas in a simple detection task, which requires neither identification nor linguistic analysis, suggests they play a fundamental role in speech processing.


2020 ◽  
Author(s):  
Brian A. Metzger ◽  
John F. Magnotti ◽  
Zhengjia Wang ◽  
Elizabeth Nesbitt ◽  
Patrick J. Karas ◽  
...  

AbstractExperimentalists studying multisensory integration compare neural responses to multisensory stimuli with responses to the component modalities presented in isolation. This procedure is problematic for multisensory speech perception since audiovisual speech and auditory-only speech are easily intelligible but visual-only speech is not. To overcome this confound, we developed intracranial encephalography (iEEG) deconvolution. Individual stimuli always contained both auditory and visual speech but jittering the onset asynchrony between modalities allowed for the time course of the unisensory responses and the interaction between them to be independently estimated. We applied this procedure to electrodes implanted in human epilepsy patients (both male and female) over the posterior superior temporal gyrus (pSTG), a brain area known to be important for speech perception. iEEG deconvolution revealed sustained, positive responses to visual-only speech and larger, phasic responses to auditory-only speech. Confirming results from scalp EEG, responses to audiovisual speech were weaker than responses to auditory- only speech, demonstrating a subadditive multisensory neural computation. Leveraging the spatial resolution of iEEG, we extended these results to show that subadditivity is most pronounced in more posterior aspects of the pSTG. Across electrodes, subadditivity correlated with visual responsiveness, supporting a model in visual speech enhances the efficiency of auditory speech processing in pSTG. The ability to separate neural processes may make iEEG deconvolution useful for studying a variety of complex cognitive and perceptual tasks.Significance statementUnderstanding speech is one of the most important human abilities. Speech perception uses information from both the auditory and visual modalities. It has been difficult to study neural responses to visual speech because visual-only speech is difficult or impossible to comprehend, unlike auditory-only and audiovisual speech. We used intracranial encephalography (iEEG) deconvolution to overcome this obstacle. We found that visual speech evokes a positive response in the human posterior superior temporal gyrus, enhancing the efficiency of auditory speech processing.


2019 ◽  
Author(s):  
Maya Inbar ◽  
Eitan Grossman ◽  
Ayelet N. Landau

AbstractStudies of speech processing investigate the relationship between temporal structure in speech stimuli and neural activity. Despite clear evidence that the brain tracks speech at low frequencies (~1 Hz), it is not well understood what linguistic information gives rise to this rhythm. Here, we harness linguistic theory to draw attention to Intonation Units (IUs), a fundamental prosodic unit of human language, and characterize their temporal structure as captured in the speech envelope, an acoustic representation relevant to the neural processing of speech.IUs are defined by a specific pattern of syllable delivery, together with resets in pitch and articulatory force. Linguistic studies of spontaneous speech indicate that this prosodic segmentation paces new information in language use across diverse languages. Therefore, IUs provide a universal structural cue for the cognitive dynamics of speech production and comprehension.We study the relation between IUs and periodicities in the speech envelope, applying methods from investigations of neural synchronization. Our sample includes recordings from every-day speech contexts of over 100 speakers and six languages. We find that sequences of IUs form a consistent low-frequency rhythm and constitute a significant periodic cue within the speech envelope. Our findings allow to predict that IUs are utilized by the neural system when tracking speech, and the methods we introduce facilitate testing this prediction given physiological data.


2021 ◽  
Vol 10 (14) ◽  
pp. 3078
Author(s):  
Sara Akbarzadeh ◽  
Sungmin Lee ◽  
Chin-Tuan Tan

In multi-speaker environments, cochlear implant (CI) users may attend to a target sound source in a different manner from normal hearing (NH) individuals during a conversation. This study attempted to investigate the effect of conversational sound levels on the mechanisms adopted by CI and NH listeners in selective auditory attention and how it affects their daily conversation. Nine CI users (five bilateral, three unilateral, and one bimodal) and eight NH listeners participated in this study. The behavioral speech recognition scores were collected using a matrix sentences test, and neural tracking to speech envelope was recorded using electroencephalography (EEG). Speech stimuli were presented at three different levels (75, 65, and 55 dB SPL) in the presence of two maskers from three spatially separated speakers. Different combinations of assisted/impaired hearing modes were evaluated for CI users, and the outcomes were analyzed in three categories: electric hearing only, acoustic hearing only, and electric + acoustic hearing. Our results showed that increasing the conversational sound level degraded the selective auditory attention in electrical hearing. On the other hand, increasing the sound level improved the selective auditory attention for the acoustic hearing group. In the NH listeners, however, increasing the sound level did not cause a significant change in the auditory attention. Our result implies that the effect of the sound level on selective auditory attention varies depending on the hearing modes, and the loudness control is necessary for the ease of attending to the conversation by CI users.


2021 ◽  
pp. 2150022
Author(s):  
Caio Cesar Enside de Abreu ◽  
Marco Aparecido Queiroz Duarte ◽  
Bruno Rodrigues de Oliveira ◽  
Jozue Vieira Filho ◽  
Francisco Villarreal

Speech processing systems are very important in different applications involving speech and voice quality such as automatic speech recognition, forensic phonetics and speech enhancement, among others. In most of them, the acoustic environmental noise is added to the original signal, decreasing the signal-to-noise ratio (SNR) and the speech quality by consequence. Therefore, estimating noise is one of the most important steps in speech processing whether to reduce it before processing or to design robust algorithms. In this paper, a new approach to estimate noise from speech signals is presented and its effectiveness is tested in the speech enhancement context. For this purpose, partial least squares (PLS) regression is used to model the acoustic environment (AE) and a Wiener filter based on a priori SNR estimation is implemented to evaluate the proposed approach. Six noise types are used to create seven acoustically modeled noises. The basic idea is to consider the AE model to identify the noise type and estimate its power to be used in a speech processing system. Speech signals processed using the proposed method and classical noise estimators are evaluated through objective measures. Results show that the proposed method produces better speech quality than state-of-the-art noise estimators, enabling it to be used in real-time applications in the field of robotic, telecommunications and acoustic analysis.


2018 ◽  
Author(s):  
Arafat Angulo-Perkins ◽  
Luis Concha

ABSTRACT Musicality refers to specific biological traits that allow us to perceive, generate and enjoy music. These abilities can be studied at different organizational levels (e.g., behavioural, physiological, evolutionary), and all of them reflect that music and speech processing are two different cognitive domains. Previous research has shown evidence of this functional divergence in auditory cortical regions in the superior temporal gyrus (such as the planum polare), showing increased activity upon listening to music, as compared to other complex acoustic signals. Here, we examine brain activity underlying vocal music and speech perception, while we compare musicians and non-musicians. We designed a stimulation paradigm using the same voice to produce spoken sentences, hummed melodies, and sung sentences; the same sentences were used in speech and song categories, and the same melodies were used in the musical categories (song and hum). Participants listened to this paradigm while we acquired functional magnetic resonance images (fMRI). Different analyses demonstrated greater involvement of specific auditory and motor regions during music perception, as compared to speech vocalizations. This music sensitive network includes bilateral activation of the planum polare and temporale, as well as a group of regions lateralized to the right hemisphere that included the supplementary motor area, premotor cortex and the inferior frontal gyrus. Our results show that the simple act of listening to music generates stronger activation of motor regions, possibly preparing us to move following the beat. Vocal musical listening, with and without lyrics, is also accompanied by a higher modulation of specific secondary auditory cortices such as the planum polare, confirming its crucial role in music processing independently of previous musical training. This study provides more evidence showing that music perception enhances audio-sensorimotor activity, crucial for clinical approaches exploring music based therapies to improve communicative and motor skills.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Raphaël Thézé ◽  
Mehdi Ali Gadiri ◽  
Louis Albert ◽  
Antoine Provost ◽  
Anne-Lise Giraud ◽  
...  

Abstract Natural speech is processed in the brain as a mixture of auditory and visual features. An example of the importance of visual speech is the McGurk effect and related perceptual illusions that result from mismatching auditory and visual syllables. Although the McGurk effect has widely been applied to the exploration of audio-visual speech processing, it relies on isolated syllables, which severely limits the conclusions that can be drawn from the paradigm. In addition, the extreme variability and the quality of the stimuli usually employed prevents comparability across studies. To overcome these limitations, we present an innovative methodology using 3D virtual characters with realistic lip movements synchronized on computer-synthesized speech. We used commercially accessible and affordable tools to facilitate reproducibility and comparability, and the set-up was validated on 24 participants performing a perception task. Within complete and meaningful French sentences, we paired a labiodental fricative viseme (i.e. /v/) with a bilabial occlusive phoneme (i.e. /b/). This audiovisual mismatch is known to induce the illusion of hearing /v/ in a proportion of trials. We tested the rate of the illusion while varying the magnitude of background noise and audiovisual lag. Overall, the effect was observed in 40% of trials. The proportion rose to about 50% with added background noise and up to 66% when controlling for phonetic features. Our results conclusively demonstrate that computer-generated speech stimuli are judicious, and that they can supplement natural speech with higher control over stimulus timing and content.


2018 ◽  
Vol 8 (4) ◽  
pp. 115
Author(s):  
Abdul Malik Abbasi ◽  
Mansoor Ahmed Channa ◽  
Stephen John ◽  
Masood Akhter Memon ◽  
Rabia Anwar

Acoustic analysis tests the hypothesis that the physical properties of Pakistani English (PaKE) vowels differ in terms of acoustic measurements of Native American English speakers. The present paper aims to document the physical behavior of English vowels produced by PaKE learners. The major goal of this paper is to measure the production of sound frequencies coupled with vowel duration. The primary aim of this paper is to explore the different frequencies and duration of the vowels involved in articulation of PaKE. English vowels selected for this purpose are: /æ/, /ɛ/, /ɪ/, /ɒ/ and /ə/. Total ten samplings were obtained from the department of computer science at Sindh Madressatul Islam University, Karachi. The study was based on the analysis of 500 (10×5×10=500) voice samples. Five vowel minimal pairs were selected and written in a carrier phrase [I say CVC now]. Ten speakers (5 male & five female) recorded their 500 voice samples using Praat speech processing tool and a high-quality microphone on laptop in a computer laboratory with no background sound. Three parameters were considered for the analysis of PaKE vowels i.e., duration of five vowels, fundamental frequency (F1 and F2). It was hypothesized that the properties of PaKE vowels are different from that of English native speakers. The hypothesis was accepted since the acoustic measurements of PaKE and English Native American speakers’ physical properties of sounds were discovered different.


Sign in / Sign up

Export Citation Format

Share Document