A speech envelope landmark for syllable encoding in human superior temporal gyrus

Mapping Intimacies ◽

10.1101/388280 ◽

2018 ◽

Cited By ~ 6

Author(s):

Yulia Oganian ◽

Edward F. Chang

Keyword(s):

Speech Processing ◽

Acoustic Analysis ◽

Brain Area ◽

Superior Temporal Gyrus ◽

Rate Of Change ◽

Local Maxima ◽

Speech Stimuli ◽

Neural Computations ◽

Absolute Amplitude ◽

Speech Envelope

AbstractListeners use the slow amplitude modulations of speech, known as the envelope, to segment continuous speech into syllables. However, the underlying neural computations are heavily debated. We used high-density intracranial cortical recordings while participants listened to natural and synthesized control speech stimuli to determine how the envelope is represented in the human superior temporal gyrus (STG), a critical auditory brain area for speech processing. We found that the STG does not encode the instantaneous, moment-by-moment amplitude envelope of speech. Rather, a zone of the middle STG detects discrete acoustic onset edges, defined by local maxima in the rate-of-change of the envelope. Acoustic analysis demonstrated that acoustic onset edges reliably cue the information-rich transition between the consonant-onset and vowel-nucleus of syllables. Furthermore, the steepness of the acoustic edge cued whether a syllable was stressed. Synthesized amplitude-modulated tone stimuli showed that steeper edges elicited monotonically greater cortical responses, confirming the encoding of relative but not absolute amplitude. Overall, encoding of the timing and magnitude of acoustic onset edges in STG underlies our perception of the syllabic rhythm of speech.

Download Full-text

A speech envelope landmark for syllable encoding in human superior temporal gyrus

Science Advances ◽

10.1126/sciadv.aay6279 ◽

2019 ◽

Vol 5 (11) ◽

pp. eaay6279 ◽

Cited By ~ 8

Author(s):

Yulia Oganian ◽

Edward F. Chang

Keyword(s):

Temporal Structure ◽

Superior Temporal Gyrus ◽

Rate Of Change ◽

Speech Comprehension ◽

Acoustic Features ◽

Local Maxima ◽

Neural Computations ◽

Absolute Amplitude ◽

Speech Envelope ◽

Intracranial Recordings

The most salient acoustic features in speech are the modulations in its intensity, captured by the amplitude envelope. Perceptually, the envelope is necessary for speech comprehension. Yet, the neural computations that represent the envelope and their linguistic implications are heavily debated. We used high-density intracranial recordings, while participants listened to speech, to determine how the envelope is represented in human speech cortical areas on the superior temporal gyrus (STG). We found that a well-defined zone in middle STG detects acoustic onset edges (local maxima in the envelope rate of change). Acoustic analyses demonstrated that timing of acoustic onset edges cues syllabic nucleus onsets, while their slope cues syllabic stress. Synthesized amplitude-modulated tone stimuli showed that steeper slopes elicited greater responses, confirming cortical encoding of amplitude change, not absolute amplitude. Overall, STG encoding of the timing and magnitude of acoustic onset edges underlies the perception of speech temporal structure.

Download Full-text

Sequences of Intonation Units form a ~ 1 Hz rhythm

Scientific Reports ◽

10.1038/s41598-020-72739-4 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Maya Inbar ◽

Eitan Grossman ◽

Ayelet N. Landau

Keyword(s):

Speech Processing ◽

Temporal Structure ◽

Low Frequency ◽

Neural System ◽

Specific Pattern ◽

Physiological Data ◽

Low Frequencies ◽

Speech Stimuli ◽

Intonation Units ◽

Speech Envelope

Abstract Studies of speech processing investigate the relationship between temporal structure in speech stimuli and neural activity. Despite clear evidence that the brain tracks speech at low frequencies (~ 1 Hz), it is not well understood what linguistic information gives rise to this rhythm. In this study, we harness linguistic theory to draw attention to Intonation Units (IUs), a fundamental prosodic unit of human language, and characterize their temporal structure as captured in the speech envelope, an acoustic representation relevant to the neural processing of speech. IUs are defined by a specific pattern of syllable delivery, together with resets in pitch and articulatory force. Linguistic studies of spontaneous speech indicate that this prosodic segmentation paces new information in language use across diverse languages. Therefore, IUs provide a universal structural cue for the cognitive dynamics of speech production and comprehension. We study the relation between IUs and periodicities in the speech envelope, applying methods from investigations of neural synchronization. Our sample includes recordings from every-day speech contexts of over 100 speakers and six languages. We find that sequences of IUs form a consistent low-frequency rhythm and constitute a significant periodic cue within the speech envelope. Our findings allow to predict that IUs are utilized by the neural system when tracking speech. The methods we introduce here facilitate testing this prediction in the future (i.e., with physiological data).

Download Full-text

Detection of Sounds in the Auditory Stream: Event-Related fMRI Evidence for Differential Activation to Speech and Nonspeech

Journal of Cognitive Neuroscience ◽

10.1162/089892901753165890 ◽

2001 ◽

Vol 13 (7) ◽

pp. 994-1005 ◽

Cited By ~ 140

Author(s):

Athena Vouloumanos ◽

Kent A. Kiehl ◽

Janet F. Werker ◽

Peter F. Liddle

Keyword(s):

Speech Processing ◽

Inferior Frontal Gyrus ◽

Superior Temporal Gyrus ◽

Sine Wave ◽

Receptive Language ◽

Detection Task ◽

Auditory Stream ◽

Speech Stimuli ◽

Simple Detection ◽

The Right

The detection of speech in an auditory stream is a requisite first step in processing spoken language. In this study, we used event-related fMRI to investigate the neural substrates mediating detection of speech compared with that of nonspeech auditory stimuli. Unlike previous studies addressing this issue, we contrasted speech with nonspeech analogues that were matched along key temporal and spectral dimensions. In an oddball detection task, listeners heard nonsense speech sounds, matched sine wave analogues (complex nonspeech), or single tones (simple nonspeech). Speech stimuli elicited significantly greater activation than both complex and simple nonspeech stimuli in classic receptive language areas, namely the middle temporal gyri bilaterally and in a locus lateralized to the left posterior superior temporal gyrus. In addition, speech activated a small cluster of the right inferior frontal gyrus. The activation of these areas in a simple detection task, which requires neither identification nor linguistic analysis, suggests they play a fundamental role in speech processing.

Download Full-text

Responses to Visual Speech in Human Posterior Superior Temporal Gyrus Examined with iEEG Deconvolution

10.1101/2020.04.16.045716 ◽

2020 ◽

Author(s):

Brian A. Metzger ◽

John F. Magnotti ◽

Zhengjia Wang ◽

Elizabeth Nesbitt ◽

Patrick J. Karas ◽

...

Keyword(s):

Speech Perception ◽

Speech Processing ◽

Time Course ◽

Brain Area ◽

Superior Temporal Gyrus ◽

Visual Speech ◽

Audiovisual Speech ◽

Neural Responses ◽

Auditory Speech ◽

Human Epilepsy

AbstractExperimentalists studying multisensory integration compare neural responses to multisensory stimuli with responses to the component modalities presented in isolation. This procedure is problematic for multisensory speech perception since audiovisual speech and auditory-only speech are easily intelligible but visual-only speech is not. To overcome this confound, we developed intracranial encephalography (iEEG) deconvolution. Individual stimuli always contained both auditory and visual speech but jittering the onset asynchrony between modalities allowed for the time course of the unisensory responses and the interaction between them to be independently estimated. We applied this procedure to electrodes implanted in human epilepsy patients (both male and female) over the posterior superior temporal gyrus (pSTG), a brain area known to be important for speech perception. iEEG deconvolution revealed sustained, positive responses to visual-only speech and larger, phasic responses to auditory-only speech. Confirming results from scalp EEG, responses to audiovisual speech were weaker than responses to auditory- only speech, demonstrating a subadditive multisensory neural computation. Leveraging the spatial resolution of iEEG, we extended these results to show that subadditivity is most pronounced in more posterior aspects of the pSTG. Across electrodes, subadditivity correlated with visual responsiveness, supporting a model in visual speech enhances the efficiency of auditory speech processing in pSTG. The ability to separate neural processes may make iEEG deconvolution useful for studying a variety of complex cognitive and perceptual tasks.Significance statementUnderstanding speech is one of the most important human abilities. Speech perception uses information from both the auditory and visual modalities. It has been difficult to study neural responses to visual speech because visual-only speech is difficult or impossible to comprehend, unlike auditory-only and audiovisual speech. We used intracranial encephalography (iEEG) deconvolution to overcome this obstacle. We found that visual speech evokes a positive response in the human posterior superior temporal gyrus, enhancing the efficiency of auditory speech processing.

Download Full-text

Sequences of Intonation Units form a ~1 Hz rhythm

10.1101/765016 ◽

2019 ◽

Author(s):

Maya Inbar ◽

Eitan Grossman ◽

Ayelet N. Landau

Keyword(s):

Speech Processing ◽

Temporal Structure ◽

Low Frequency ◽

Neural System ◽

Specific Pattern ◽

Physiological Data ◽

Low Frequencies ◽

Speech Stimuli ◽

Intonation Units ◽

Speech Envelope

AbstractStudies of speech processing investigate the relationship between temporal structure in speech stimuli and neural activity. Despite clear evidence that the brain tracks speech at low frequencies (~1 Hz), it is not well understood what linguistic information gives rise to this rhythm. Here, we harness linguistic theory to draw attention to Intonation Units (IUs), a fundamental prosodic unit of human language, and characterize their temporal structure as captured in the speech envelope, an acoustic representation relevant to the neural processing of speech.IUs are defined by a specific pattern of syllable delivery, together with resets in pitch and articulatory force. Linguistic studies of spontaneous speech indicate that this prosodic segmentation paces new information in language use across diverse languages. Therefore, IUs provide a universal structural cue for the cognitive dynamics of speech production and comprehension.We study the relation between IUs and periodicities in the speech envelope, applying methods from investigations of neural synchronization. Our sample includes recordings from every-day speech contexts of over 100 speakers and six languages. We find that sequences of IUs form a consistent low-frequency rhythm and constitute a significant periodic cue within the speech envelope. Our findings allow to predict that IUs are utilized by the neural system when tracking speech, and the methods we introduce facilitate testing this prediction given physiological data.

Download Full-text

The Spatial Selective Auditory Attention of Cochlear Implant Users in Different Conversational Sound Levels

Journal of Clinical Medicine ◽

10.3390/jcm10143078 ◽

2021 ◽

Vol 10 (14) ◽

pp. 3078

Author(s):

Sara Akbarzadeh ◽

Sungmin Lee ◽

Chin-Tuan Tan

Keyword(s):

Cochlear Implant ◽

Sound Level ◽

Auditory Attention ◽

Impaired Hearing ◽

Speech Stimuli ◽

Electric Hearing ◽

Sound Levels ◽

Acoustic Hearing ◽

Speech Envelope ◽

Different Levels

In multi-speaker environments, cochlear implant (CI) users may attend to a target sound source in a different manner from normal hearing (NH) individuals during a conversation. This study attempted to investigate the effect of conversational sound levels on the mechanisms adopted by CI and NH listeners in selective auditory attention and how it affects their daily conversation. Nine CI users (five bilateral, three unilateral, and one bimodal) and eight NH listeners participated in this study. The behavioral speech recognition scores were collected using a matrix sentences test, and neural tracking to speech envelope was recorded using electroencephalography (EEG). Speech stimuli were presented at three different levels (75, 65, and 55 dB SPL) in the presence of two maskers from three spatially separated speakers. Different combinations of assisted/impaired hearing modes were evaluated for CI users, and the outcomes were analyzed in three categories: electric hearing only, acoustic hearing only, and electric + acoustic hearing. Our results showed that increasing the conversational sound level degraded the selective auditory attention in electrical hearing. On the other hand, increasing the sound level improved the selective auditory attention for the acoustic hearing group. In the NH listeners, however, increasing the sound level did not cause a significant change in the auditory attention. Our result implies that the effect of the sound level on selective auditory attention varies depending on the hearing modes, and the loudness control is necessary for the ease of attending to the conversation by CI users.

Download Full-text

Regression-Based Noise Modeling for Speech Signal Processing

Fluctuation and Noise Letters ◽

10.1142/s021947752150022x ◽

2021 ◽

pp. 2150022

Author(s):

Caio Cesar Enside de Abreu ◽

Marco Aparecido Queiroz Duarte ◽

Bruno Rodrigues de Oliveira ◽

Jozue Vieira Filho ◽

Francisco Villarreal

Keyword(s):

Speech Enhancement ◽

Speech Processing ◽

Acoustic Analysis ◽

Voice Quality ◽

Wiener Filter ◽

Processing System ◽

Speech Quality ◽

Speech Signals ◽

Speech Signal Processing ◽

Acoustic Environment

Speech processing systems are very important in different applications involving speech and voice quality such as automatic speech recognition, forensic phonetics and speech enhancement, among others. In most of them, the acoustic environmental noise is added to the original signal, decreasing the signal-to-noise ratio (SNR) and the speech quality by consequence. Therefore, estimating noise is one of the most important steps in speech processing whether to reduce it before processing or to design robust algorithms. In this paper, a new approach to estimate noise from speech signals is presented and its effectiveness is tested in the speech enhancement context. For this purpose, partial least squares (PLS) regression is used to model the acoustic environment (AE) and a Wiener filter based on a priori SNR estimation is implemented to evaluate the proposed approach. Six noise types are used to create seven acoustically modeled noises. The basic idea is to consider the AE model to identify the noise type and estimate its power to be used in a speech processing system. Speech signals processed using the proposed method and classical noise estimators are evaluated through objective measures. Results show that the proposed method produces better speech quality than state-of-the-art noise estimators, enabling it to be used in real-time applications in the field of robotic, telecommunications and acoustic analysis.

Download Full-text

Discerning the functional networks behind processing of music and speech through human vocalizations

10.1101/411074 ◽

2018 ◽

Author(s):

Arafat Angulo-Perkins ◽

Luis Concha

Keyword(s):

Speech Processing ◽

Music Perception ◽

Right Hemisphere ◽

Musical Training ◽

Inferior Frontal Gyrus ◽

Functional Divergence ◽

Premotor Cortex ◽

Superior Temporal Gyrus ◽

Magnetic Resonance Images ◽

Biological Traits

ABSTRACT Musicality refers to specific biological traits that allow us to perceive, generate and enjoy music. These abilities can be studied at different organizational levels (e.g., behavioural, physiological, evolutionary), and all of them reflect that music and speech processing are two different cognitive domains. Previous research has shown evidence of this functional divergence in auditory cortical regions in the superior temporal gyrus (such as the planum polare), showing increased activity upon listening to music, as compared to other complex acoustic signals. Here, we examine brain activity underlying vocal music and speech perception, while we compare musicians and non-musicians. We designed a stimulation paradigm using the same voice to produce spoken sentences, hummed melodies, and sung sentences; the same sentences were used in speech and song categories, and the same melodies were used in the musical categories (song and hum). Participants listened to this paradigm while we acquired functional magnetic resonance images (fMRI). Different analyses demonstrated greater involvement of specific auditory and motor regions during music perception, as compared to speech vocalizations. This music sensitive network includes bilateral activation of the planum polare and temporale, as well as a group of regions lateralized to the right hemisphere that included the supplementary motor area, premotor cortex and the inferior frontal gyrus. Our results show that the simple act of listening to music generates stronger activation of motor regions, possibly preparing us to move following the beat. Vocal musical listening, with and without lyrics, is also accompanied by a higher modulation of specific secondary auditory cortices such as the planum polare, confirming its crucial role in music processing independently of previous musical training. This study provides more evidence showing that music perception enhances audio-sensorimotor activity, crucial for clinical approaches exploring music based therapies to improve communicative and motor skills.

Download Full-text

Animated virtual characters to explore audio-visual speech in controlled and naturalistic environments

Scientific Reports ◽

10.1038/s41598-020-72375-y ◽

2020 ◽

Vol 10 (1) ◽

Cited By ~ 1

Author(s):

Raphaël Thézé ◽

Mehdi Ali Gadiri ◽

Louis Albert ◽

Antoine Provost ◽

Anne-Lise Giraud ◽

...

Keyword(s):

Speech Processing ◽

Background Noise ◽

Mcgurk Effect ◽

Visual Speech ◽

Natural Speech ◽

Virtual Characters ◽

Speech Stimuli ◽

Stimulus Timing ◽

Phonetic Features ◽

Set Up

Abstract Natural speech is processed in the brain as a mixture of auditory and visual features. An example of the importance of visual speech is the McGurk effect and related perceptual illusions that result from mismatching auditory and visual syllables. Although the McGurk effect has widely been applied to the exploration of audio-visual speech processing, it relies on isolated syllables, which severely limits the conclusions that can be drawn from the paradigm. In addition, the extreme variability and the quality of the stimuli usually employed prevents comparability across studies. To overcome these limitations, we present an innovative methodology using 3D virtual characters with realistic lip movements synchronized on computer-synthesized speech. We used commercially accessible and affordable tools to facilitate reproducibility and comparability, and the set-up was validated on 24 participants performing a perception task. Within complete and meaningful French sentences, we paired a labiodental fricative viseme (i.e. /v/) with a bilabial occlusive phoneme (i.e. /b/). This audiovisual mismatch is known to induce the illusion of hearing /v/ in a proportion of trials. We tested the rate of the illusion while varying the magnitude of background noise and audiovisual lag. Overall, the effect was observed in 40% of trials. The proportion rose to about 50% with added background noise and up to 66% when controlling for phonetic features. Our results conclusively demonstrate that computer-generated speech stimuli are judicious, and that they can supplement natural speech with higher control over stimulus timing and content.

Download Full-text

An Acoustic Investigation of Pakistani and American English Vowels

International Journal of English Linguistics ◽

10.5539/ijel.v8n4p115 ◽

2018 ◽

Vol 8 (4) ◽

pp. 115

Author(s):

Abdul Malik Abbasi ◽

Mansoor Ahmed Channa ◽

Stephen John ◽

Masood Akhter Memon ◽

Rabia Anwar

Keyword(s):

Native American ◽

Physical Properties ◽

Speech Processing ◽

Acoustic Analysis ◽

American English ◽

Vowel Duration ◽

Acoustic Measurements ◽

Computer Laboratory ◽

Minimal Pairs ◽

English Vowels

Acoustic analysis tests the hypothesis that the physical properties of Pakistani English (PaKE) vowels differ in terms of acoustic measurements of Native American English speakers. The present paper aims to document the physical behavior of English vowels produced by PaKE learners. The major goal of this paper is to measure the production of sound frequencies coupled with vowel duration. The primary aim of this paper is to explore the different frequencies and duration of the vowels involved in articulation of PaKE. English vowels selected for this purpose are: /æ/, /ɛ/, /ɪ/, /ɒ/ and /ə/. Total ten samplings were obtained from the department of computer science at Sindh Madressatul Islam University, Karachi. The study was based on the analysis of 500 (10×5×10=500) voice samples. Five vowel minimal pairs were selected and written in a carrier phrase [I say CVC now]. Ten speakers (5 male & five female) recorded their 500 voice samples using Praat speech processing tool and a high-quality microphone on laptop in a computer laboratory with no background sound. Three parameters were considered for the analysis of PaKE vowels i.e., duration of five vowels, fundamental frequency (F1 and F2). It was hypothesized that the properties of PaKE vowels are different from that of English native speakers. The hypothesis was accepted since the acoustic measurements of PaKE and English Native American speakers’ physical properties of sounds were discovered different.

Download Full-text