scholarly journals Using Privacy-Transformed Speech in the Automatic Speech Recognition Acoustic Model Training

Author(s):  
Askars Salimbajevs

Automatic Speech Recognition (ASR) requires huge amounts of real user speech data to reach state-of-the-art performance. However, speech data conveys sensitive speaker attributes like identity that can be inferred and exploited for malicious purposes. Therefore, there is an interest in the collection of anonymized speech data that is processed by some voice conversion method. In this paper, we evaluate one of the voice conversion methods on Latvian speech data and also investigate if privacy-transformed data can be used to improve ASR acoustic models. Results show the effectiveness of voice conversion against state-of-the-art speaker verification models on Latvian speech and the effectiveness of using privacy-transformed data in ASR training.

Author(s):  
Conghui Tan ◽  
Di Jiang ◽  
Jinhua Peng ◽  
Xueyang Wu ◽  
Qian Xu ◽  
...  

Due to the rising awareness of privacy protection and the voluminous scale of speech data, it is becoming infeasible for Automatic Speech Recognition (ASR) system developers to train the acoustic model with complete data as before. In this paper, we propose a novel Divide-and-Merge paradigm to solve salient problems plaguing the ASR field. In the Divide phase, multiple acoustic models are trained based upon different subsets of the complete speech data, while in the Merge phase two novel algorithms are utilized to generate a high-quality acoustic model based upon those trained on data subsets. We first propose the Genetic Merge Algorithm (GMA), which is a highly specialized algorithm for optimizing acoustic models but suffers from low efficiency. We further propose the SGD-Based Optimizational Merge Algorithm (SOMA), which effectively alleviates the efficiency bottleneck of GMA and maintains superior performance. Extensive experiments on public data show that the proposed methods can significantly outperform the state-of-the-art.


Ingeniería ◽  
2017 ◽  
Vol 22 (3) ◽  
pp. 362 ◽  
Author(s):  
Juan David Celis Nuñez ◽  
Rodrigo Andres Llanos Castro ◽  
Byron Medina Delgado ◽  
Sergio Basilio Sepúlveda Mora ◽  
Sergio Alexander Castro Casadiego

 Context: Automatic speech recognition requires the development of language and acoustic models for different existing dialects. The purpose of this research is the training of an acoustic model, a statistical language model and a grammar language model for the Spanish language, specifically for the dialect of the city of San Jose de Cucuta, Colombia, that can be used in a command control system. Existing models for the Spanish language have problems in the recognition of the fundamental frequency and the spectral content, the accent, pronunciation, tone or simply the language model for Cucuta's dialect.Method: in this project, we used Raspberry Pi B+ embedded system with Raspbian operating system which is a Linux distribution and two open source software, namely CMU-Cambridge Statistical Language Modeling Toolkit from the University of Cambridge and CMU Sphinx from Carnegie Mellon University; these software are based on Hidden Markov Models for the calculation of voice parameters. Besides, we used 1913 recorded audios with the voice of people from San Jose de Cucuta and Norte de Santander department. These audios were used for training and testing the automatic speech recognition system.Results: we obtained a language model that consists of two files, one is the statistical language model (.lm), and the other is the jsgf grammar model (.jsgf). Regarding the acoustic component, two models were trained, one of them with an improved version which had a 100 % accuracy rate in the training results and 83 % accuracy rate in the audio tests for command recognition. Finally, we elaborated a manual for the creation of acoustic and language models with CMU Sphinx software.Conclusions: The number of participants in the training process of the language and acoustic models has a significant influence on the quality of the voice processing of the recognizer. The use of a large dictionary for the training process and a short dictionary with the command words for the implementation is important to get a better response of the automatic speech recognition system. Considering the accuracy rate above 80 % in the voice recognition tests, the proposed models are suitable for applications oriented to the assistance of visual or motion impairment people.


Accent is one of the issue for speech recognition systems. Automatic Speech Recognition systems must yield high performance for different dialects. In this work, Neutral Kannada Automatic Speech Recognition is implemented using Kaldi software for monophone modelling and triphone modeling. The acoustic models are constructed using the techniques such as monophone, triphone1, triphone2, triphone3. In triphone modeling, grouping of interphones is performed. Feature extraction is performed by Mel Frequency Cepstral Coefficients. The system performance is analysed by measuring Word Error Rate using different acoustic models. To know the robustness and performance of the Neutral Kannada Automatic Speech Recognition system for different dialects in Kannada, the system is tested for North Kannada accent. Better sentence accuracy is obtained for Neutral Kannada Automatic Speech Recognition system and is about 90%. The performance is degraded, when tested for North Kannada accent and the accuracy obtained is around 77%. The performance is degraded due to the increasing mismatch between the training and testing data set, as the Neutral Kannada Automatic Speech Recognition system is trained only for neutral Kannada acoustic model and doesn't include north Kannada acoustic model. Interactive Kannada voice response system is implemented to identify continuous Kannada speech sentences.


Author(s):  
Danny Henry Galatang ◽  
◽  
Suyanto Suyanto ◽  

The syllable-based automatic speech recognition (ASR) systems commonly perform better than the phoneme-based ones. This paper focuses on developing an Indonesian monosyllable-based ASR (MSASR) system using an ASR engine called SPRAAK and comparing it to a phoneme-based one. The Mozilla DeepSpeech-based end-to-end ASR (MDSE2EASR), one of the state-of-the-art models based on character (similar to the phoneme-based model), is also investigated to confirm the result. Besides, a novel Kaituoxu SpeechTransformer (KST) E2EASR is also examined. Testing on the Indonesian speech corpus of 5,439 words shows that the proposed MSASR produces much higher word accuracy (76.57%) than the monophone-based one (63.36%). Its performance is comparable to the character-based MDS-E2EASR, which produces 76.90%, and the character-based KST-E2EASR (78.00%). In the future, this monosyllable-based ASR is possible to be improved to the bisyllable-based one to give higher word accuracy. Nevertheless, extensive bisyllable acoustic models must be handled using an advanced method.


Author(s):  
Alexandru-Lucian Georgescu ◽  
Alessandro Pappalardo ◽  
Horia Cucu ◽  
Michaela Blott

AbstractThe last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipeline-based systems, that modeled hand-crafted speech features with probabilistic frameworks and generated phone posteriors, to end-to-end (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the trade-off between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of trade-off analysis for state-of-the-art ASR systems.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Asmaa El Hannani ◽  
Rahhal Errattahi ◽  
Fatima Zahra Salmam ◽  
Thomas Hain ◽  
Hassan Ouahmane

AbstractSpeech based human-machine interaction and natural language understanding applications have seen a rapid development and wide adoption over the last few decades. This has led to a proliferation of studies that investigate Error detection and classification in Automatic Speech Recognition (ASR) systems. However, different data sets and evaluation protocols are used, making direct comparisons of the proposed approaches (e.g. features and models) difficult. In this paper we perform an extensive evaluation of the effectiveness and efficiency of state-of-the-art approaches in a unified framework for both errors detection and errors type classification. We make three primary contributions throughout this paper: (1) we have compared our Variant Recurrent Neural Network (V-RNN) model with three other state-of-the-art neural based models, and have shown that the V-RNN model is the most effective classifier for ASR error detection in term of accuracy and speed, (2) we have compared four features’ settings, corresponding to different categories of predictor features and have shown that the generic features are particularly suitable for real-time ASR error detection applications, and (3) we have looked at the post generalization ability of our error detection framework and performed a detailed post detection analysis in order to perceive the recognition errors that are difficult to detect.


Author(s):  
Sonal Anilkumar Tiwari

Abstract: This can be quite interesting when we think that we commanding something to in-animated objects. Yes it is possible with the help of ASR systems. Speech recognition system is a system that can make humans to talk with machineries. Nowadays speech recognition is such a technique that without it, a person cannot do any of his work properly. People get addicted of it. And it has become a habit for humans like we use mobile phones but when we want to type something, then we immediately can pass the voice commands. With which our Efforts are reduced, as well as a lot of our time. Keywords: Speech, Speech Recognition, ASR, Corpus, PRAAT


Sign in / Sign up

Export Citation Format

Share Document