speech recognizers
Recently Published Documents


TOTAL DOCUMENTS

96
(FIVE YEARS 4)

H-INDEX

10
(FIVE YEARS 0)

2021 ◽  
Vol 176 ◽  
pp. 114860
Author(s):  
Alejandro Coucheiro-Limeres ◽  
Javier Ferreiros-López ◽  
Fernando Fernández-Martínez ◽  
Ricardo Córdoba
Keyword(s):  

2021 ◽  
Vol 5 (EICS) ◽  
pp. 1-24
Author(s):  
Xinlei Zhang ◽  
Takashi Miyaki ◽  
Jun Rekimoto

Conversational agents are widely used in many situations, especially for speech tutoring. However, their contents and functions are often pre-defined and not customizable for people without technical backgrounds, thus significantly limiting their flexibility and usability. Besides, conventional agents often cannot provide feedback in the middle of training sessions because they lack technical approaches to evaluate users' speech dynamically. We propose JustSpeak: automated and interactive speech tutoring agents with various configurable feedback mechanisms, using any speech recordings with its transcription text as the template for speech training. In JustSpeak, we developed an automated procedure to generate customized tutoring agents from user-inputted templates. Moreover, we created a set of methods to dynamically synchronize speech recognizers' behavior with the agent's tutoring progress, making it possible to detect various speech mistakes dynamically such as being stuck, mispronunciation, and rhythm deviations. Furthermore, we identified the design primitives in JustSpeak to create different novel feedback mechanisms, such as adaptive playback, follow-on training, and passive adaptation. They can be combined to create customized tutoring agents, which we demonstrate with an example for language learning. We believe JustSpeak can create more personalized speech learning opportunities by enabling tutoring agents that are customizable, always available, and easy-to-use.


Author(s):  
D.N.V.S.L.S. Indira, Et. al.

The importance of integrating visual components into the speech recognition process for improving robustness has been identified by recent developments in audio visual emotion recognition (AVER). Visual characteristics have a strong potential to boost the accuracy of current techniques for speech recognition and have become increasingly important when modelling speech recognizers. CNN is very good to work with images. An audio file can be converted into image file like a spectrogram with good frequency to extract hidden knowledge. This paper provides a method for emotional expression recognition using Spectrograms and CNN-2D. Spectrograms formed from the signals of speech it’s a CNN-2D input. The proposed model, which consists of three layers of CNN and those are convolution layers, pooling layers and fully connected layers extract discriminatory characteristics from the representations of spectrograms and for the seven feelings, performance estimates. This article compares the output with the existing SER using audio files and CNN. The accuracy is improved by 6.5% when CNN-2D is used.


2020 ◽  
Vol 34 (04) ◽  
pp. 6917-6924 ◽  
Author(s):  
Ya Zhao ◽  
Rui Xu ◽  
Xinchao Wang ◽  
Peng Hou ◽  
Haihong Tang ◽  
...  

Lip reading has witnessed unparalleled development in recent years thanks to deep learning and the availability of large-scale datasets. Despite the encouraging results achieved, the performance of lip reading, unfortunately, remains inferior to the one of its counterpart speech recognition, due to the ambiguous nature of its actuations that makes it challenging to extract discriminant features from the lip movement videos. In this paper, we propose a new method, termed as Lip by Speech (LIBS), of which the goal is to strengthen lip reading by learning from speech recognizers. The rationale behind our approach is that the features extracted from speech recognizers may provide complementary and discriminant clues, which are formidable to be obtained from the subtle movements of the lips, and consequently facilitate the training of lip readers. This is achieved, specifically, by distilling multi-granularity knowledge from speech recognizers to lip readers. To conduct this cross-modal knowledge distillation, we utilize an efficacious alignment scheme to handle the inconsistent lengths of the audios and videos, as well as an innovative filtering strategy to refine the speech recognizer's prediction. The proposed method achieves the new state-of-the-art performance on the CMLR and LRS2 datasets, outperforming the baseline by a margin of 7.66% and 2.75% in character error rate, respectively.


2019 ◽  
Vol 8 (4) ◽  
pp. 7186-7189

Dysarthria is a disorder that is caused in the nervous system. It is caused by damage in some parts of the brain such as cerebellum. Because of the damage in brain it causes weakness in muscles used for speech therefore it happens as in mumbled, slurred or slow speech that human and the machine find it difficult to understand such slurred speech. The Automatic Speech Recognizers which were designed for speech intelligibility perform poorly on dysarthric speech recognition. This paper focuses on the transformation of voice for dysarthria to enhance its intelligibility Formant tracking, pitch and energy estimation with durational cues from dysarthric speech facilitate the modification of these trajectories to more closely approximate the desired intelligible target speech. The transformation of the speech is done by using formant re-synthesis, pitch change and duration morphing. The results of such transformation results indicate that the transformation of the pitch and duration step enhances the intelligibility of dysarthric speech and make it easy to understand for humans and machines.


2018 ◽  
Author(s):  
Soma Khan ◽  
Madhab Pal ◽  
Joyanta Basu ◽  
Milton Samirakshma Bepari ◽  
Rajib Roy

Sign in / Sign up

Export Citation Format

Share Document