scholarly journals Speech Fine Structure Contains Critical Temporal Cues to Support Speech Segmentation

2018 ◽  
Author(s):  
Xiangbin Teng ◽  
Gregory Cogan ◽  
David Poeppel

Segmenting the continuous speech stream into units for further perceptual and linguistic analyses is fundamental to speech recognition. The speech amplitude envelope (SE) has long been considered a fundamental temporal cue for segmenting speech. Does the temporal fine structure (TFS), a significant part of speech signals often considered to contain primarily spectral information, contribute to speech segmentation? Using magnetoencephalography, we show that the TFS entrains cortical oscillatory responses between 3-6 Hz and demonstrate, using mutual information analysis, that (i) the temporal information in the TFS can be reconstructed from a measure of frame-to-frame spectral change and correlates with the SE and (ii) that spectral resolution is key to the extraction of such temporal information. Furthermore, we show behavioural evidence that, when the SE is temporally distorted, the TFS provides cues for speech segmentation and aids speech recognition significantly. Our findings show that it is insufficient to investigate solely the SE to understand temporal speech segmentation, as the SE and the TFS derived from a band-filtering method convey comparable, if not inseparable, temporal information. We argue for a more synthetic view of speech segmentation — the auditory system groups speech signals coherently in both temporal and spectral domains.

NeuroImage ◽  
2019 ◽  
Vol 202 ◽  
pp. 116152 ◽  
Author(s):  
Xiangbin Teng ◽  
Gregory B. Cogan ◽  
David Poeppel

1998 ◽  
Vol 41 (2) ◽  
pp. 315-326 ◽  
Author(s):  
Pamela E. Souza ◽  
Christopher W. Turner

Although multichannel compression systems are quickly becoming integral components of programmable hearing aids, research results have not consistently demonstrated their benefit over conventional amplification. The present study examined two confounding factors that may have contributed to this inconsistency in results: alteration of temporal information and audibility of speech cues. Recognition of linearly amplified and multichannel-compressed speech was measured for listeners with mild-to-severe sensorineural hearing loss and for a control group of listeners with normal hearing. In addition to the standard speech signal, which provided both temporal and spectral information, the listener's ability to use temporal information in a multichannel compressed signal was directly tested using a signal-correlated noise (SCN) stimulus. This stimulus consisted of a time-varying speech envelope modulating a two-channel noise carrier. It preserved temporal cues but provided minimal spectral information. For each stimulus condition, short-term level measurements were used to determine the range of audible speech. Multichannel compression improved speech recognition under conditions where superior audibility was provided by the twochannel compression system over linear amplification. When audibility of both linearly amplified and multichannel-compressed speech was maximized, the multichannel compression had no significant effect on speech recognition score for speech containing both temporal and spectral cues. However, results for the SCN stimuli show that more extreme amounts of multichannel compression can reduce use of temporal information.


2020 ◽  
Vol 24 ◽  
pp. 233121652098029
Author(s):  
Allison Trine ◽  
Brian B. Monson

Several studies have demonstrated that extended high frequencies (EHFs; >8 kHz) in speech are not only audible but also have some utility for speech recognition, including for speech-in-speech recognition when maskers are facing away from the listener. However, the contribution of EHF spectral versus temporal information to speech recognition is unknown. Here, we show that access to EHF temporal information improved speech-in-speech recognition relative to speech bandlimited at 8 kHz but that additional access to EHF spectral detail provided an additional small but significant benefit. Results suggest that both EHF spectral structure and the temporal envelope contribute to the observed EHF benefit. Speech recognition performance was quite sensitive to masker head orientation, with a rotation of only 15° providing a highly significant benefit. An exploratory analysis indicated that pure-tone thresholds at EHFs are better predictors of speech recognition performance than low-frequency pure-tone thresholds.


2019 ◽  
Vol 7 (3) ◽  
pp. 219-242 ◽  
Author(s):  
Kyle J. Comishen ◽  
Scott A. Adler

The capacity to process and incorporate temporal information into behavioural decisions is an integral component for functioning in our environment. Whereas previous research has extended adults’ temporal processing capacity down the developmental timeline to infants, little research has examined infants’ capacity to use that temporal information in guiding their future behaviours and whether this capacity can detect event-timing differences on the order of milliseconds. The present study examined 3- and 6-month-old infants’ ability to process temporal durations of 700 and 1200 milliseconds by means of the Visual Expectation Cueing Paradigm in which the duration of a central stimulus predicted either a target appearing on the left or on the right of a screen. If 3- and 6-month-old infants could discriminate the milliseconds difference between the centrally-presented temporal cues, then they would correctly make anticipatory eye movements to the proper target location at a rate above chance. Results indicated that 6- but not 3-month-olds successfully discriminated and incorporated events’ temporal information into their visual expectations. Brain maturation and the perceptual capacity to discriminate the relative timing values of temporal events may account for these findings. This developmental limitation in processing and discriminating events on the scale of milliseconds, consequently, may be a limiting factor for attentional and cognitive development that has not previously been explored.


2017 ◽  
Vol 68 (2) ◽  
pp. 346-354
Author(s):  
Ján Staš ◽  
Daniel Hládek ◽  
Peter Viszlay ◽  
Tomáš Koctúr

Abstract This paper describes a new Slovak speech recognition dedicated corpus built from TEDx talks and Jump Slovakia lectures. The proposed speech database consists of 220 talks and lectures in total duration of about 58 hours. Annotated speech database was generated automatically in an unsupervised manner by using acoustic speech segmentation based on principal component analysis and automatic speech transcription using two complementary speech recognition systems. The evaluation data consisting of 50 manually annotated talks and lectures in total duration of about 12 hours, has been created for evaluation of the quality of Slovak speech recognition. By unsupervised automatic annotation of TEDx talks and Jump Slovakia lectures we have obtained 21.26% of new speech segments with approximately 9.44% word error rate, suitable for retraining or adaptation of acoustic models trained beforehand.


2021 ◽  
Author(s):  
Laurianne Cabrera ◽  
Bonnie K. Lau

The processing of auditory temporal information is important for the extraction of voice pitch, linguistic information, as well as the overall temporal structure of speech. However, many aspects regarding its early development remains not well understood. This paper reviews the development of different aspects of auditory temporal processing during the first year of life when infants are acquiring their native language. First, potential mechanisms of neural immaturity are discussed in the context of neurophysiological studies. Next, what is known about infant auditory capabilities is considered with a focus on psychophysical studies involving non-speech stimuli to investigate the perception of temporal fine structure and envelope cues. This is followed by a review of studies involving speech stimuli, including those that present vocoded signals as a method of degrading the spectro-temporal information available to infant listeners. Finally, we highlight key findings from the cochlear implant literature that illustrate the importance of temporal cues in speech perception.


Author(s):  
A. Vatri ◽  
B. McGillivray

The Diorisis Ancient Greek Corpus is a digital collection of ancient Greek texts (from Homer to the early fifth century ad) compiled for linguistic analyses, and specifically with the purpose of developing a computational model of semantic change in Ancient Greek. The corpus consists of 820 texts sourced from open access digital libraries. The texts have been automatically enriched with morphological information for each word. The automatic assignment of words to the correct dictionary entry (lemmatization) has been disambiguated with the implementation of a part-of-speech tagger (a computer programme that may select the part of speech to which an ambiguous word belongs).


Sign in / Sign up

Export Citation Format

Share Document