Multi-Level Processing in Human Speech Recognition

1989 ◽  
Author(s):  
Peter C. Gordon
2018 ◽  
Vol 10 (11) ◽  
pp. 4615-4624 ◽  
Author(s):  
Shubhanshi Singhal ◽  
Vishal Passricha ◽  
Pooja Sharma ◽  
Rajesh Kumar Aggarwal

Author(s):  
PHILIPPE MORIN ◽  
JEAN-PAUL HATON ◽  
JEAN-MARIE PIERREL ◽  
GUENTHER RUSKE ◽  
WALTER WEIGEL

In the framework of man-machine communication, oral dialogue has a particular place since human speech presents several advantages when used either alone or in multimedia interfaces. The last decade has witnessed a proliferation of research into speech recognition and understanding, but few systems have been defined with a view to managing and understanding an actual man-machine dialogue. The PARTNER system that we describe in this paper proposes a solution in the case of task oriented dialogue with the use of artificial languages. A description of the essential characteristics of dialogue systems is followed by a presentation of the architecture and the principles of the PARTNER system. Finally, we present the most recent results obtained in the oral management of electronic mail in French and German.


2020 ◽  
Vol 287 (1941) ◽  
pp. 20202531
Author(s):  
Julia Fischer ◽  
Franziska Wegdell ◽  
Franziska Trede ◽  
Federica Dal Pesco ◽  
Kurt Hammerschmidt

The extent to which nonhuman primate vocalizations are amenable to modification through experience is relevant for understanding the substrate from which human speech evolved. We examined the vocal behaviour of Guinea baboons, Papio papio , ranging in the Niokolo Koba National Park in Senegal. Guinea baboons live in a multi-level society, with units nested within parties nested within gangs. We investigated whether the acoustic structure of grunts of 27 male baboons of two gangs varied with party/gang membership and genetic relatedness. Males in this species are philopatric, resulting in increased male relatedness within gangs and parties. Grunts of males that were members of the same social levels were more similar than those of males in different social levels ( N = 351 dyads for comparison within and between gangs, and N = 169 dyads within and between parties), but the effect sizes were small. Yet, acoustic similarity did not correlate with genetic relatedness, suggesting that higher amounts of social interactions rather than genetic relatedness promote the observed vocal convergence. We consider this convergence a result of sensory–motor integration and suggest this to be an implicit form of vocal learning shared with humans, in contrast to the goal-directed and intentional explicit form of vocal learning unique to human speech acquisition.


Author(s):  
Chu-Xiong Qin ◽  
Wen-Lin Zhang ◽  
Dan Qu

Abstract A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-of-the-art end-to-end system in WER.


Sign in / Sign up

Export Citation Format

Share Document