scholarly journals Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition

2019 ◽  
Vol 9 (21) ◽  
pp. 4639 ◽  
Author(s):  
Long Wu ◽  
Ta Li ◽  
Li Wang ◽  
Yonghong Yan

As demonstrated in hybrid connectionist temporal classification (CTC)/Attention architecture, joint training with a CTC objective is very effective to solve the misalignment problem existing in the attention-based end-to-end automatic speech recognition (ASR) framework. However, the CTC output relies only on the current input, which leads to the hard alignment issue. To address this problem, this paper proposes the time-restricted attention CTC/Attention architecture, which integrates an attention mechanism with the CTC branch. “Time-restricted” means that the attention mechanism is conducted on a limited window of frames to the left and right. In this study, we first explore time-restricted location-aware attention CTC/Attention, establishing the proper time-restricted attention window size. Inspired by the success of self-attention in machine translation, we further introduce the time-restricted self-attention CTC/Attention that can better model the long-range dependencies among the frames. Experiments with wall street journal (WSJ), augmented multiparty interaction (AMI), and switchboard (SWBD) tasks demonstrate the effectiveness of the proposed time-restricted self-attention CTC/Attention. Finally, to explore the robustness of this method to noise and reverberation, we join a train neural beamformer frontend with the time-restricted attention CTC/Attention ASR backend in the CHIME-4 dataset. The reduction of word error rate (WER) and the increase of perceptual evaluation of speech quality (PESQ) approve the effectiveness of this framework.

Author(s):  
Chu-Xiong Qin ◽  
Wen-Lin Zhang ◽  
Dan Qu

Abstract A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-of-the-art end-to-end system in WER.


Sensors ◽  
2021 ◽  
Vol 21 (9) ◽  
pp. 3063
Author(s):  
Aleksandr Laptev ◽  
Andrei Andrusenko ◽  
Ivan Podluzhny ◽  
Anton Mitrofanov ◽  
Ivan Medennikov ◽  
...  

With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token’s contexts and to regularize their distribution for the model’s recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.


2021 ◽  
Author(s):  
Zhong Meng ◽  
Yu Wu ◽  
Naoyuki Kanda ◽  
Liang Lu ◽  
Xie Chen ◽  
...  

Author(s):  
Nikita Markovnikov ◽  
Irina Kipyatkova

Problem: Classical systems of automatic speech recognition are traditionally built using an acoustic model based on hidden Markovmodels and a statistical language model. Such systems demonstrate high recognition accuracy, but consist of several independentcomplex parts, which can cause problems when building models. Recently, an end-to-end recognition method has been spread, usingdeep artificial neural networks. This approach makes it easy to implement models using just one neural network. End-to-end modelsoften demonstrate better performance in terms of speed and accuracy of speech recognition. Purpose: Implementation of end-toendmodels for the recognition of continuous Russian speech, their adjustment and comparison with hybrid base models in terms ofrecognition accuracy and computational characteristics, such as the speed of learning and decoding. Methods: Creating an encoderdecodermodel of speech recognition using an attention mechanism; applying techniques of stabilization and regularization of neuralnetworks; augmentation of data for training; using parts of words as an output of a neural network. Results: An encoder-decodermodel was obtained using an attention mechanism for recognizing continuous Russian speech without extracting features or usinga language model. As elements of the output sequence, we used parts of words from the training set. The resulting model could notsurpass the basic hybrid models, but surpassed the other baseline end-to-end models, both in recognition accuracy and in decoding/learning speed. The word recognition error was 24.17% and the decoding speed was 0.3 of the real time, which is 6% faster than thebaseline end-to-end model and 46% faster than the basic hybrid model. We showed that end-to-end models could work without languagemodels for the Russian language, while demonstrating a higher decoding speed than hybrid models. The resulting model was trained onraw data without extracting any features. We found that for the Russian language the hybrid type of an attention mechanism gives thebest result compared to location-based or context-based attention mechanisms. Practical relevance: The resulting models require lessmemory and less speech decoding time than the traditional hybrid models. That fact can allow them to be used locally on mobile deviceswithout using calculations on remote servers.


Author(s):  
Jinxi Guo ◽  
Gautam Tiwari ◽  
Jasha Droppo ◽  
Maarten Van Segbroeck ◽  
Che-Wei Huang ◽  
...  

2019 ◽  
Vol 8 (2S11) ◽  
pp. 2350-2352

the dissimilarity in recognizing the word sequence and their ground truth in different channels can be absorbed by implementing Automatic Speech Recognition which is the standard evaluation metric and is encountered with the phenomena of Word Error Rate for various measures. In the model of 1ch, the track is trained without any preprocessing and study on multichannel end-to-end Automatic Speech Recognition envisaged that the function can be integrated into (Deep Neural network) – based system and lead to multiple experimental results. More so, when the Word Error Rate (WER) is not directly differentiable, it is pertinent to adopt Encoder – Decoder gradient objective function which has been clear in CHiME-4 system. In this study, we examine that the sequence level evaluation metric is a fair choice for optimizing Encoder – Decoder model for which many training algorithms is designed to reduce sequence level error. The study incorporates the scoring of multiple hypotheses in decoding stage for improving the decoding result to optimum. By this, the mismatch between the objectives is resulted in a feasible form to the maxim. Hence, the study finds the result of voice recognition which is most effective for adaptation.


Author(s):  
Lujun Li ◽  
Yikai Kang ◽  
Yuchen Shi ◽  
Ludwig Kürzinger ◽  
Tobias Watzel ◽  
...  

AbstractLately, the self-attention mechanism has marked a new milestone in the field of automatic speech recognition (ASR). Nevertheless, its performance is susceptible to environmental intrusions as the system predicts the next output symbol depending on the full input sequence and the previous predictions. A popular solution for this problem is adding an independent speech enhancement module as the front-end. Nonetheless, due to being trained separately from the ASR module, the independent enhancement front-end falls into the sub-optimum easily. Besides, the handcrafted loss function of the enhancement module tends to introduce unseen distortions, which even degrade the ASR performance. Inspired by the extensive applications of the generative adversarial networks (GANs) in speech enhancement and ASR tasks, we propose an adversarial joint training framework with the self-attention mechanism to boost the noise robustness of the ASR system. Generally, it consists of a self-attention speech enhancement GAN and a self-attention end-to-end ASR model. There are two advantages which are worth noting in this proposed framework. One is that it benefits from the advancement of both self-attention mechanism and GANs, while the other is that the discriminator of GAN plays the role of the global discriminant network in the stage of the adversarial joint training, which guides the enhancement front-end to capture more compatible structures for the subsequent ASR module and thereby offsets the limitation of the separate training and handcrafted loss functions. With the adversarial joint optimization, the proposed framework is expected to learn more robust representations suitable for the ASR task. We execute systematic experiments on the corpus AISHELL-1, and the experimental results show that on the artificial noisy test set, the proposed framework achieves the relative improvements of 66% compared to the ASR model trained by clean data solely, 35.1% compared to the speech enhancement and ASR scheme without joint training, and 5.3% compared to multi-condition training.


Author(s):  
Jia-nan Chen ◽  
Shuang Gao ◽  
Han-zhe Sun ◽  
Xiao-hui Liu ◽  
Zi-ning Wang ◽  
...  

2021 ◽  
pp. 1-13
Author(s):  
Hamzah A. Alsayadi ◽  
Abdelaziz A. Abdelhamid ◽  
Islam Hegazy ◽  
Zaki T. Fayed

Arabic language has a set of sound letters called diacritics, these diacritics play an essential role in the meaning of words and their articulations. The change in some diacritics leads to a change in the context of the sentence. However, the existence of these letters in the corpus transcription affects the accuracy of speech recognition. In this paper, we investigate the effect of diactrics on the Arabic speech recognition based end-to-end deep learning. The applied end-to-end approach includes CNN-LSTM and attention-based technique presented in the state-of-the-art framework namely, Espresso using Pytorch. In addition, and to the best of our knowledge, the approach of CNN-LSTM with attention-based has not been used in the task of Arabic Automatic speech recognition (ASR). To fill this gap, this paper proposes a new approach based on CNN-LSTM with attention based method for Arabic ASR. The language model in this approach is trained using RNN-LM and LSTM-LM and based on nondiacritized transcription of the speech corpus. The Standard Arabic Single Speaker Corpus (SASSC), after omitting the diacritics, is used to train and test the deep learning model. Experimental results show that the removal of diacritics decreased out-of-vocabulary and perplexity of the language model. In addition, the word error rate (WER) is significantly improved when compared to diacritized data. The achieved average reduction in WER is 13.52%.


Sign in / Sign up

Export Citation Format

Share Document