Speech Processing in Support of Human-Human Communication (Invited Paper)

Author(s):  
Alex Waibel
2013 ◽  
Vol 309 ◽  
pp. 260-267
Author(s):  
Laszlo Czap ◽  
Judit Pinter

The most comfortable way of human communication is speech, which is a possible channel of human-machine interface as well. Moreover, a voice driven system can be controlled with busy hands. Performance of a speech recognition system is highly decayed by presence of noise. Logistic systems typically work in noisy environment, so noise reduction is crucial in industrial speech processing systems. Traditional noise reduction procedures (e.g. Wiener and Kalman filters) are effective on stationary or Gaussian noise. The noise of a real workplace can be captured by an additional microphone: The voice microphone takes both speech and noise, while the noise mike takes only the noise signal. Because of the phase shift of the two signals, simple subtraction in time domain is ineffective. In this paper, we discuss a spectral representation modeling the noise and voice signals. A frequency spectrum based noise cancellation method is proposed and verified in real industrial environment.


2016 ◽  
Vol 2016 ◽  
pp. 1-8
Author(s):  
Vasilisa Verkhodanova ◽  
Vladimir Shapranov

The development and popularity of voice-user interfaces made spontaneous speech processing an important research field. One of the main focus areas in this field is automatic speech recognition (ASR) that enables the recognition and translation of spoken language into text by computers. However, ASR systems often work less efficiently for spontaneous than for read speech, since the former differs from any other type of speech in many ways. And the presence of speech disfluencies is its prominent characteristic. These phenomena are an important feature in human-human communication and at the same time they are a challenging obstacle for the speech processing tasks. In this paper we address an issue of voiced hesitations (filled pauses and sound lengthenings) detection in Russian spontaneous speech by utilizing different machine learning techniques, from grid search and gradient descent in rule-based approaches to such data-driven ones as ELM and SVM based on the automatically extracted acoustic features. Experimental results on the mixed and quality diverse corpus of spontaneous Russian speech indicate the efficiency of the techniques for the task in question, with SVM outperforming other methods.


2019 ◽  
Author(s):  
Jonathan E. Peelle

Understanding the neural systems supporting speech perception can shed light on the representations, processes, and variability in human communication. In the case of speech and language disorders, uncovering the neurological underpinnings can sometimes lead to surgical or medical treatments. Even in the case of healthy listeners, better understanding the interactions among hierarchical brain systems during speech processing can deepen our understanding of perceptual and language processes, and how these might be affected during development, hearing loss, or in background noise. Current neurobiological frameworks largely agree on the importance of bilateral temporal cortex for processing auditory speech, with the addition of left frontal cortex for more complex linguistic structures (such as sentences). Although visual cortex is clearly important for audiovisual speech processing, there is continued debate about where and how auditory and visual signals are integrated. Studies offer evidence supporting multisensory roles for posterior superior temporal sulcus, auditory cortex, and motor cortex. Rather than a single integration mechanism, it may be that visual and auditory inputs are combined in different ways depending on the type of information being processed. Importantly, core speech regions are not always sufficient for successfully understanding spoken language. Increased linguistic complexity or acoustic challenge forces listeners to recruit additional neural systems. In many cases compensatory activity is seen in executive and attention systems, such as the cingulo-opercular or frontoparietal networks. These patterns of increased activity appear to depend on the auditory and cognitive abilities of individual listeners, indicating a systems-level balance between neural systems that dynamically adjusts to the acoustic properties of the speech and current task demand. Speech perception is thus a shining example of flexible neural processing and behavioral stability.


2019 ◽  
Vol 8 (3) ◽  
pp. 8597-8600

This paper presents a brief survey on accent detection, accent identification, and accent classification. Speech processing has becoming more popular and inspiring expanses lately in signal processing area. It is because speech is one of the most natural form of human communication. However, in processing speech signals intrinsically show many variations even without background noise. Two different person can produce different spectrograms when saying the same sentence. Dialect or Accent is one of the most important factors that can influence the Automatic Speech Recognition or ASR performance besides gender (Unsupervised accent class). Many researches show that dialect or accent in speech can significantly affect the speech system performance. Various methods have been used to increase the accuracy of ASR with accent detection, accent identification, and accent classification. Fused i-vector and Phonotactic are the latest technique that shows a significant degree of accuracy. The purpose of this paper is to briefly survey on accent detection, accent identification, and accent classification and discuss the major improvements made in the past almost 10 years of research


2009 ◽  
Vol 23 (2) ◽  
pp. 63-76 ◽  
Author(s):  
Silke Paulmann ◽  
Sarah Jessen ◽  
Sonja A. Kotz

The multimodal nature of human communication has been well established. Yet few empirical studies have systematically examined the widely held belief that this form of perception is facilitated in comparison to unimodal or bimodal perception. In the current experiment we first explored the processing of unimodally presented facial expressions. Furthermore, auditory (prosodic and/or lexical-semantic) information was presented together with the visual information to investigate the processing of bimodal (facial and prosodic cues) and multimodal (facial, lexic, and prosodic cues) human communication. Participants engaged in an identity identification task, while event-related potentials (ERPs) were being recorded to examine early processing mechanisms as reflected in the P200 and N300 component. While the former component has repeatedly been linked to physical property stimulus processing, the latter has been linked to more evaluative “meaning-related” processing. A direct relationship between P200 and N300 amplitude and the number of information channels present was found. The multimodal-channel condition elicited the smallest amplitude in the P200 and N300 components, followed by an increased amplitude in each component for the bimodal-channel condition. The largest amplitude was observed for the unimodal condition. These data suggest that multimodal information induces clear facilitation in comparison to unimodal or bimodal information. The advantage of multimodal perception as reflected in the P200 and N300 components may thus reflect one of the mechanisms allowing for fast and accurate information processing in human communication.


2009 ◽  
Vol 14 (1) ◽  
pp. 78-89 ◽  
Author(s):  
Kenneth Hugdahl ◽  
René Westerhausen

The present paper is based on a talk on hemispheric asymmetry given by Kenneth Hugdahl at the Xth European Congress of Psychology, Praha July 2007. Here, we propose that hemispheric asymmetry evolved because of a left hemisphere speech processing specialization. The evolution of speech and the need for air-based communication necessitated division of labor between the hemispheres in order to avoid having duplicate copies in both hemispheres that would increase processing redundancy. It is argued that the neuronal basis of this labor division is the structural asymmetry observed in the peri-Sylvian region in the posterior part of the temporal lobe, with a left larger than right planum temporale area. This is the only example where a structural, or anatomical, asymmetry matches a corresponding functional asymmetry. The increase in gray matter volume in the left planum temporale area corresponds to a functional asymmetry of speech processing, as indexed from both behavioral, dichotic listening, and functional neuroimaging studies. The functional anatomy of the corpus callosum also supports such a view, with regional specificity of information transfer between the hemispheres.


1988 ◽  
Vol 33 (10) ◽  
pp. 920-921
Author(s):  
L. Kristine Pond
Keyword(s):  

2012 ◽  
Author(s):  
Christine M. Szostak ◽  
Mark A. Pitt ◽  
Laura C. Dilley

Sign in / Sign up

Export Citation Format

Share Document