speech recognizers Latest Research Papers

The Energy and Carbon Footprint of Training End-to-End Speech Recognizers

10.21437/interspeech.2021-456 ◽

2021 ◽

Author(s):

Titouan Parcollet ◽

Mirco Ravanelli

Keyword(s):

Carbon Footprint ◽

Speech Recognizers ◽

End To End

Download Full-text

A dynamic term discovery strategy for automatic speech recognizers with evolving dictionaries

Expert Systems with Applications ◽

10.1016/j.eswa.2021.114860 ◽

2021 ◽

Vol 176 ◽

pp. 114860

Author(s):

Alejandro Coucheiro-Limeres ◽

Javier Ferreiros-López ◽

Fernando Fernández-Martínez ◽

Ricardo Córdoba

Keyword(s):

Speech Recognizers

Download Full-text

JustSpeak: Automated, User-Configurable, Interactive Agents for Speech Tutoring

Proceedings of the ACM on Human-Computer Interaction ◽

10.1145/3459744 ◽

2021 ◽

Vol 5 (EICS) ◽

pp. 1-24

Author(s):

Xinlei Zhang ◽

Takashi Miyaki ◽

Jun Rekimoto

Keyword(s):

Language Learning ◽

Learning Opportunities ◽

Conversational Agents ◽

Feedback Mechanisms ◽

Speech Recognizers ◽

Speech Training ◽

Speech Learning

Conversational agents are widely used in many situations, especially for speech tutoring. However, their contents and functions are often pre-defined and not customizable for people without technical backgrounds, thus significantly limiting their flexibility and usability. Besides, conventional agents often cannot provide feedback in the middle of training sessions because they lack technical approaches to evaluate users' speech dynamically. We propose JustSpeak: automated and interactive speech tutoring agents with various configurable feedback mechanisms, using any speech recordings with its transcription text as the template for speech training. In JustSpeak, we developed an automated procedure to generate customized tutoring agents from user-inputted templates. Moreover, we created a set of methods to dynamically synchronize speech recognizers' behavior with the agent's tutoring progress, making it possible to detect various speech mistakes dynamically such as being stuck, mispronunciation, and rhythm deviations. Furthermore, we identified the design primitives in JustSpeak to create different novel feedback mechanisms, such as adaptive playback, follow-on training, and passive adaptation. They can be combined to create customized tutoring agents, which we demonstrate with an example for language learning. We believe JustSpeak can create more personalized speech learning opportunities by enabling tutoring agents that are customizable, always available, and easy-to-use.

Download Full-text

An Enhanced CNN-2D for Audio-Visual Emotion Recognition (AVER) Using ADAM Optimizer

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i5.2030 ◽

2021 ◽

Vol 12 (5) ◽

pp. 1378-1388

Author(s):

D.N.V.S.L.S. Indira, Et. al.

Keyword(s):

Speech Recognition ◽

Emotion Recognition ◽

Expression Recognition ◽

Proposed Model ◽

Speech Recognizers ◽

Recent Developments ◽

Hidden Knowledge ◽

Audio Files ◽

Fully Connected ◽

Performance Estimates

The importance of integrating visual components into the speech recognition process for improving robustness has been identified by recent developments in audio visual emotion recognition (AVER). Visual characteristics have a strong potential to boost the accuracy of current techniques for speech recognition and have become increasingly important when modelling speech recognizers. CNN is very good to work with images. An audio file can be converted into image file like a spectrogram with good frequency to extract hidden knowledge. This paper provides a method for emotional expression recognition using Spectrograms and CNN-2D. Spectrograms formed from the signals of speech it’s a CNN-2D input. The proposed model, which consists of three layers of CNN and those are convolution layers, pooling layers and fully connected layers extract discriminatory characteristics from the representations of spectrograms and for the seven feelings, performance estimates. This article compares the output with the existing SER using audio files and CNN. The accuracy is improved by 6.5% when CNN-2D is used.

Download Full-text

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6174 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6917-6924 ◽

Cited By ~ 1

Author(s):

Ya Zhao ◽

Rui Xu ◽

Xinchao Wang ◽

Peng Hou ◽

Haihong Tang ◽

...

Keyword(s):

Deep Learning ◽

Speech Recognition ◽

Error Rate ◽

Large Scale ◽

State Of The Art ◽

Lip Reading ◽

Speech Recognizers ◽

Lip Movement ◽

Knowledge Distillation ◽

The One

Lip reading has witnessed unparalleled development in recent years thanks to deep learning and the availability of large-scale datasets. Despite the encouraging results achieved, the performance of lip reading, unfortunately, remains inferior to the one of its counterpart speech recognition, due to the ambiguous nature of its actuations that makes it challenging to extract discriminant features from the lip movement videos. In this paper, we propose a new method, termed as Lip by Speech (LIBS), of which the goal is to strengthen lip reading by learning from speech recognizers. The rationale behind our approach is that the features extracted from speech recognizers may provide complementary and discriminant clues, which are formidable to be obtained from the subtle movements of the lips, and consequently facilitate the training of lip readers. This is achieved, specifically, by distilling multi-granularity knowledge from speech recognizers to lip readers. To conduct this cross-modal knowledge distillation, we utilize an efficacious alignment scheme to handle the inconsistent lengths of the audios and videos, as well as an innovative filtering strategy to refine the speech recognizer's prediction. The proposed method achieves the new state-of-the-art performance on the CMLR and LRS2 datasets, outperforming the baseline by a margin of 7.66% and 2.75% in character error rate, respectively.

Download Full-text

Prosodic Analysis and Enhancement of Dysarthric Speech

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d5264.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 7186-7189

Keyword(s):

Nervous System ◽

Speech Recognition ◽

Speech Intelligibility ◽

Energy Estimation ◽

Pitch Change ◽

Formant Tracking ◽

Speech Recognizers ◽

Dysarthric Speech ◽

Slow Speech ◽

The Brain

Dysarthria is a disorder that is caused in the nervous system. It is caused by damage in some parts of the brain such as cerebellum. Because of the damage in brain it causes weakness in muscles used for speech therefore it happens as in mumbled, slurred or slow speech that human and the machine find it difficult to understand such slurred speech. The Automatic Speech Recognizers which were designed for speech intelligibility perform poorly on dysarthric speech recognition. This paper focuses on the transformation of voice for dysarthria to enhance its intelligibility Formant tracking, pitch and energy estimation with durational cues from dysarthric speech facilitate the modification of these trajectories to more closely approximate the desired intelligible target speech. The transformation of the speech is done by using formant re-synthesis, pitch change and duration morphing. The results of such transformation results indicate that the transformation of the pitch and duration step enhances the intelligibility of dysarthric speech and make it easy to understand for humans and machines.

Download Full-text

Performance of Korean spontaneous speech recognizers based on an extended phone set derived from acoustic data*

Phonetics and Speech Sciences ◽

10.13064/ksss.2019.11.3.039 ◽

2019 ◽

Vol 11 (3) ◽

pp. 39-47

Author(s):

Jeong-Uk Bang ◽

Sang-Hun Kim ◽

Oh-Wook Kwon

Keyword(s):

Spontaneous Speech ◽

Acoustic Data ◽

Speech Recognizers

Download Full-text

Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2019.8682319 ◽

2019 ◽

Cited By ~ 2

Author(s):

Prashant Serai ◽

Peidong Wang ◽

Eric Fosler-Lussier

Keyword(s):

Speech Recognition ◽

Error Prediction ◽

Recognition Error ◽

Speech Recognizers

Download Full-text

Contributions to Speech and Language processing towards Automatic Speech Recognizers with Evolving Dictionaries

10.20868/upm.thesis.58115 ◽

2019 ◽

Author(s):

Alejandro Coucheiro Limeres

Keyword(s):

Language Processing ◽

Speech And Language ◽

Speech Recognizers ◽

Speech And Language Processing

Download Full-text

Assessing Performance of Bengali Speech Recognizers Under Real World Conditions using GMM-HMM and DNN based Methods

10.21437/sltu.2018-39 ◽

2018 ◽

Author(s):

Soma Khan ◽

Madhab Pal ◽

Joyanta Basu ◽

Milton Samirakshma Bepari ◽

Rajib Roy

Keyword(s):

Real World ◽

Speech Recognizers

Download Full-text

speech recognizers
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

The Energy and Carbon Footprint of Training End-to-End Speech Recognizers

A dynamic term discovery strategy for automatic speech recognizers with evolving dictionaries

JustSpeak: Automated, User-Configurable, Interactive Agents for Speech Tutoring

An Enhanced CNN-2D for Audio-Visual Emotion Recognition (AVER) Using ADAM Optimizer

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Prosodic Analysis and Enhancement of Dysarthric Speech

Performance of Korean spontaneous speech recognizers based on an extended phone set derived from acoustic data*

Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers

Contributions to Speech and Language processing towards Automatic Speech Recognizers with Evolving Dictionaries

Assessing Performance of Bengali Speech Recognizers Under Real World Conditions using GMM-HMM and DNN based Methods

Export Citation Format

speech recognizersRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

The Energy and Carbon Footprint of Training End-to-End Speech Recognizers

A dynamic term discovery strategy for automatic speech recognizers with evolving dictionaries

JustSpeak: Automated, User-Configurable, Interactive Agents for Speech Tutoring

An Enhanced CNN-2D for Audio-Visual Emotion Recognition (AVER) Using ADAM Optimizer

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Prosodic Analysis and Enhancement of Dysarthric Speech

Performance of Korean spontaneous speech recognizers based on an extended phone set derived from acoustic data*

Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers

Contributions to Speech and Language processing towards Automatic Speech Recognizers with Evolving Dictionaries

Assessing Performance of Bengali Speech Recognizers Under Real World Conditions using GMM-HMM and DNN based Methods

speech recognizers
Recently Published Documents