Speech Fine Structure Contains Critical Temporal Cues to Support Speech Segmentation

Mapping Intimacies ◽

10.1101/508358 ◽

2018 ◽

Author(s):

Xiangbin Teng ◽

Gregory Cogan ◽

David Poeppel

Keyword(s):

Fine Structure ◽

Speech Recognition ◽

Temporal Information ◽

Speech Segmentation ◽

Speech Signals ◽

Part Of Speech ◽

Temporal Cues ◽

Linguistic Analyses ◽

Synthetic View ◽

Mutual Information Analysis

Segmenting the continuous speech stream into units for further perceptual and linguistic analyses is fundamental to speech recognition. The speech amplitude envelope (SE) has long been considered a fundamental temporal cue for segmenting speech. Does the temporal fine structure (TFS), a significant part of speech signals often considered to contain primarily spectral information, contribute to speech segmentation? Using magnetoencephalography, we show that the TFS entrains cortical oscillatory responses between 3-6 Hz and demonstrate, using mutual information analysis, that (i) the temporal information in the TFS can be reconstructed from a measure of frame-to-frame spectral change and correlates with the SE and (ii) that spectral resolution is key to the extraction of such temporal information. Furthermore, we show behavioural evidence that, when the SE is temporally distorted, the TFS provides cues for speech segmentation and aids speech recognition significantly. Our findings show that it is insufficient to investigate solely the SE to understand temporal speech segmentation, as the SE and the TFS derived from a band-filtering method convey comparable, if not inseparable, temporal information. We argue for a more synthetic view of speech segmentation — the auditory system groups speech signals coherently in both temporal and spectral domains.

Download Full-text

Speech fine structure contains critical temporal cues to support speech segmentation

NeuroImage ◽

10.1016/j.neuroimage.2019.116152 ◽

2019 ◽

Vol 202 ◽

pp. 116152 ◽

Cited By ~ 6

Author(s):

Xiangbin Teng ◽

Gregory B. Cogan ◽

David Poeppel

Keyword(s):

Fine Structure ◽

Speech Segmentation ◽

Temporal Cues

Download Full-text

Multichannel Compression, Temporal Cues, and Audibility

Journal of Speech Language and Hearing Research ◽

10.1044/jslhr.4102.315 ◽

1998 ◽

Vol 41 (2) ◽

pp. 315-326 ◽

Cited By ~ 42

Author(s):

Pamela E. Souza ◽

Christopher W. Turner

Keyword(s):

Speech Recognition ◽

Hearing Aids ◽

Temporal Information ◽

Spectral Information ◽

Channel Noise ◽

Control Group ◽

Correlated Noise ◽

Compressed Speech ◽

Temporal Cues ◽

Speech Envelope

Although multichannel compression systems are quickly becoming integral components of programmable hearing aids, research results have not consistently demonstrated their benefit over conventional amplification. The present study examined two confounding factors that may have contributed to this inconsistency in results: alteration of temporal information and audibility of speech cues. Recognition of linearly amplified and multichannel-compressed speech was measured for listeners with mild-to-severe sensorineural hearing loss and for a control group of listeners with normal hearing. In addition to the standard speech signal, which provided both temporal and spectral information, the listener's ability to use temporal information in a multichannel compressed signal was directly tested using a signal-correlated noise (SCN) stimulus. This stimulus consisted of a time-varying speech envelope modulating a two-channel noise carrier. It preserved temporal cues but provided minimal spectral information. For each stimulus condition, short-term level measurements were used to determine the range of audible speech. Multichannel compression improved speech recognition under conditions where superior audibility was provided by the twochannel compression system over linear amplification. When audibility of both linearly amplified and multichannel-compressed speech was maximized, the multichannel compression had no significant effect on speech recognition score for speech containing both temporal and spectral cues. However, results for the SCN stimuli show that more extreme amounts of multichannel compression can reduce use of temporal information.

Download Full-text

Extended High Frequencies Provide Both Spectral and Temporal Information to Improve Speech-in-Speech Recognition

Trends in Hearing ◽

10.1177/2331216520980299 ◽

2020 ◽

Vol 24 ◽

pp. 233121652098029

Author(s):

Allison Trine ◽

Brian B. Monson

Keyword(s):

Speech Recognition ◽

Pure Tone ◽

Recognition Performance ◽

Low Frequency ◽

Exploratory Analysis ◽

Temporal Information ◽

Significant Benefit ◽

Head Orientation ◽

Spectral Structure ◽

High Frequencies

Several studies have demonstrated that extended high frequencies (EHFs; >8 kHz) in speech are not only audible but also have some utility for speech recognition, including for speech-in-speech recognition when maskers are facing away from the listener. However, the contribution of EHF spectral versus temporal information to speech recognition is unknown. Here, we show that access to EHF temporal information improved speech-in-speech recognition relative to speech bandlimited at 8 kHz but that additional access to EHF spectral detail provided an additional small but significant benefit. Results suggest that both EHF spectral structure and the temporal envelope contribute to the observed EHF benefit. Speech recognition performance was quite sensitive to masker head orientation, with a rotation of only 15° providing a highly significant benefit. An exploratory analysis indicated that pure-tone thresholds at EHFs are better predictors of speech recognition performance than low-frequency pure-tone thresholds.

Download Full-text

The Development of Infants’ Expectations for Event Timing

Timing & Time Perception ◽

10.1163/22134468-20191148 ◽

2019 ◽

Vol 7 (3) ◽

pp. 219-242 ◽

Cited By ~ 1

Author(s):

Kyle J. Comishen ◽

Scott A. Adler

Keyword(s):

Target Location ◽

Temporal Information ◽

Brain Maturation ◽

Limiting Factor ◽

Anticipatory Eye Movements ◽

Event Timing ◽

Temporal Cues ◽

Temporal Events ◽

The Right ◽

Visual Expectation

The capacity to process and incorporate temporal information into behavioural decisions is an integral component for functioning in our environment. Whereas previous research has extended adults’ temporal processing capacity down the developmental timeline to infants, little research has examined infants’ capacity to use that temporal information in guiding their future behaviours and whether this capacity can detect event-timing differences on the order of milliseconds. The present study examined 3- and 6-month-old infants’ ability to process temporal durations of 700 and 1200 milliseconds by means of the Visual Expectation Cueing Paradigm in which the duration of a central stimulus predicted either a target appearing on the left or on the right of a screen. If 3- and 6-month-old infants could discriminate the milliseconds difference between the centrally-presented temporal cues, then they would correctly make anticipatory eye movements to the proper target location at a rate above chance. Results indicated that 6- but not 3-month-olds successfully discriminated and incorporated events’ temporal information into their visual expectations. Brain maturation and the perceptual capacity to discriminate the relative timing values of temporal events may account for these findings. This developmental limitation in processing and discriminating events on the scale of milliseconds, consequently, may be a limiting factor for attentional and cognitive development that has not previously been explored.

Download Full-text

TEDxSK and JumpSK: A New Slovak Speech Recognition Dedicated Corpus

Journal of Linguistics/Jazykovedný casopis ◽

10.1515/jazcas-2017-0044 ◽

2017 ◽

Vol 68 (2) ◽

pp. 346-354

Author(s):

Ján Staš ◽

Daniel Hládek ◽

Peter Viszlay ◽

Tomáš Koctúr

Keyword(s):

Speech Recognition ◽

Total Duration ◽

Principal Component ◽

Speech Segmentation ◽

Word Error Rate ◽

Speech Database ◽

Recognition Systems ◽

Speech Transcription ◽

Speech Segments

Abstract This paper describes a new Slovak speech recognition dedicated corpus built from TEDx talks and Jump Slovakia lectures. The proposed speech database consists of 220 talks and lectures in total duration of about 58 hours. Annotated speech database was generated automatically in an unsupervised manner by using acoustic speech segmentation based on principal component analysis and automatic speech transcription using two complementary speech recognition systems. The evaluation data consisting of 50 manually annotated talks and lectures in total duration of about 12 hours, has been created for evaluation of the quality of Slovak speech recognition. By unsupervised automatic annotation of TEDx talks and Jump Slovakia lectures we have obtained 21.26% of new speech segments with approximately 9.44% word error rate, suitable for retraining or adaptation of acoustic models trained beforehand.

Download Full-text

Third-Order Moments of Filtered Speech Signals for Robust Speech Recognition

Nonlinear Analyses and Algorithms for Speech Processing - Lecture Notes in Computer Science ◽

10.1007/11613107_24 ◽

2006 ◽

pp. 277-283 ◽

Cited By ~ 3

Author(s):

Kevin M. Indrebo ◽

Richard J. Povinelli ◽

Michael T. Johnson

Keyword(s):

Speech Recognition ◽

Speech Signals ◽

Robust Speech Recognition ◽

Third Order ◽

Filtered Speech

Download Full-text

The development of auditory temporal processing during the first year of life

10.31234/osf.io/s6b8e ◽

2021 ◽

Author(s):

Laurianne Cabrera ◽

Bonnie K. Lau

Keyword(s):

Temporal Processing ◽

Temporal Structure ◽

Temporal Information ◽

First Year ◽

Auditory Temporal Processing ◽

Speech Stimuli ◽

Temporal Cues ◽

Neurophysiological Studies ◽

Psychophysical Studies ◽

First Year Of Life

The processing of auditory temporal information is important for the extraction of voice pitch, linguistic information, as well as the overall temporal structure of speech. However, many aspects regarding its early development remains not well understood. This paper reviews the development of different aspects of auditory temporal processing during the first year of life when infants are acquiring their native language. First, potential mechanisms of neural immaturity are discussed in the context of neurophysiological studies. Next, what is known about infant auditory capabilities is considered with a focus on psychophysical studies involving non-speech stimuli to investigate the perception of temporal fine structure and envelope cues. This is followed by a review of studies involving speech stimuli, including those that present vocoded signals as a method of degrading the spectro-temporal information available to infant listeners. Finally, we highlight key findings from the cochlear implant literature that illustrate the importance of temporal cues in speech perception.

Download Full-text

Using Part of Speech N-Grams for Improving Automatic Speech Recognition of Polish

Machine Learning and Data Mining in Pattern Recognition - Lecture Notes in Computer Science ◽

10.1007/978-3-642-39712-7_38 ◽

2013 ◽

pp. 492-504 ◽

Cited By ~ 3

Author(s):

Aleksander Pohl ◽

Bartosz Ziółko

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Part Of Speech

Download Full-text

The Diorisis Ancient Greek Corpus

Research Data Journal for the Humanities and Social Sciences ◽

10.1163/24523666-01000013 ◽

2018 ◽

Vol 3 (1) ◽

pp. 55-65 ◽

Cited By ~ 1

Author(s):

A. Vatri ◽

B. McGillivray

Keyword(s):

Open Access ◽

Digital Libraries ◽

Ambiguous Word ◽

Fifth Century ◽

Computer Programme ◽

Semantic Change ◽

Ancient Greek ◽

Part Of Speech ◽

Digital Collection ◽

Linguistic Analyses

The Diorisis Ancient Greek Corpus is a digital collection of ancient Greek texts (from Homer to the early fifth century ad) compiled for linguistic analyses, and specifically with the purpose of developing a computational model of semantic change in Ancient Greek. The corpus consists of 820 texts sourced from open access digital libraries. The texts have been automatically enriched with morphological information for each word. The automatic assignment of words to the correct dictionary entry (lemmatization) has been disambiguated with the implementation of a part-of-speech tagger (a computer programme that may select the part of speech to which an ambiguous word belongs).

Download Full-text

Myanmar Continuous Speech Recognition System Using Fuzzy Logic Classification in Speech Segmentation

Proceedings of the 2018 International Conference on Intelligent Information Technology - ICIIT 2018 ◽

10.1145/3193063.3193071 ◽

2018 ◽

Cited By ~ 1

Author(s):

Yin Win Chit ◽

Soe Soe Khaing

Keyword(s):

Fuzzy Logic ◽

Speech Recognition ◽

Recognition System ◽

Speech Segmentation ◽

Speech Recognition System ◽

Continuous Speech ◽

Continuous Speech Recognition

Download Full-text