scholarly journals Study of algorithms to combine multiple automatic speech recognition (ASR) system outputs

2009 ◽  
Author(s):  
Harish Kashyap Krishnamurthy
2017 ◽  
Vol 60 (9) ◽  
pp. 2394-2405 ◽  
Author(s):  
Lionel Fontan ◽  
Isabelle Ferrané ◽  
Jérôme Farinas ◽  
Julien Pinquier ◽  
Julien Tardieu ◽  
...  

Purpose The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist audiologists/hearing-aid dispensers in the fine-tuning of hearing aids. Method Sixty young participants with normal hearing listened to speech materials mimicking the perceptual consequences of ARHL at different levels of severity. Two intelligibility tests (repetition of words and sentences) and 1 comprehension test (responding to oral commands by moving virtual objects) were administered. Several language models were developed and used by the ASR system in order to fit human performances. Results Strong significant positive correlations were observed between human and ASR scores, with coefficients up to .99. However, the spectral smearing used to simulate losses in frequency selectivity caused larger declines in ASR performance than in human performance. Conclusion Both intelligibility and comprehension scores for listeners with simulated ARHL are highly correlated with the performances of an ASR-based system. In the future, it needs to be determined if the ASR system is similarly successful in predicting speech processing in noise and by older people with ARHL.


2020 ◽  
Vol 8 (5) ◽  
pp. 1677-1681

Stuttering or Stammering is a speech defect within which sounds, syllables, or words are rehashed or delayed, disrupting the traditional flow of speech. Stuttering can make it hard to speak with other individuals, which regularly have an effect on an individual's quality of life. Automatic Speech Recognition (ASR) system is a technology that converts audio speech signal into corresponding text. Presently ASR systems play a major role in controlling or providing inputs to the various applications. Such an ASR system and Machine Translation Application suffers a lot due to stuttering (speech dysfluency). Dysfluencies will affect the phrase consciousness accuracy of an ASR, with the aid of increasing word addition, substitution and dismissal rates. In this work we focused on detecting and removing the prolongation, silent pauses and repetition to generate proper text sequence for the given stuttered speech signal. The stuttered speech recognition consists of two stages namely classification using LSTM and testing in ASR. The major phases of classification system are Re-sampling, Segmentation, Pre-Emphasis, Epoch Extraction and Classification. The current work is carried out in UCLASS Stuttering dataset using MATLAB with 4% to 6% increase in accuracy when compare with ANN and SVM.


Author(s):  
Mohit Dua ◽  
Pawandeep Singh Sethi ◽  
Vinam Agrawal ◽  
Raghav Chawla

Introduction: An Automatic Speech Recognition (ASR) system enables to recognize the speech utterances and thus can be used to convert speech into text for various purposes. These systems are deployed in different environments such as clean or noisy and are used by all ages or types of people. These also present some of the major difficulties faced in the development of an ASR system. Thus, an ASR system need to be efficient, while also being accurate and robust. Our main goal is to minimize the error rate during training as well as testing phases, while implementing an ASR system. Performance of ASR depends upon different combinations of feature extraction techniques and back-end techniques. In this paper, using a continuous speech recognition system, the performance comparison of different combinations of feature extraction techniques and various types of back-end techniques has been presented Methods: Hidden Markov Models (HMMs), Subspace Gaussian Mixture Models (SGMMs) and Deep Neural Networks (DNNs) with DNN-HMM architecture, namely Karel's, Dan's and Hybrid DNN-SGMM architecture are used at the back-end of the implemented system. Mel frequency Cepstral Coefficient (MFCC), Perceptual Linear Prediction (PLP), and Gammatone Frequency Cepstral coefficients (GFCC) are used as feature extraction techniques at the front-end of the proposed system. Kaldi toolkit has been used for the implementation of the proposed work. The system is trained on the Texas Instruments-Massachusetts Institute of Technology (TIMIT) speech corpus for English language Results: The experimental results show that MFCC outperforms GFCC and PLP in noiseless conditions, while PLP tends to outperform MFCC and GFCC in noisy conditions. Furthermore, the hybrid of Dan's DNN implementation along with SGMM performs the best for the back-end acoustic modeling. The proposed architecture with PLP feature extraction technique in the front end and hybrid of Dan's DNN implementation along with SGMM at the back end outperforms the other combinations in a noisy environment. Conclusion: Automatic Speech recognition has numerous applications in our lives like Home automation, Personal assistant, Robotics etc. It is highly desirable to build an ASR system with good performance. The performance Automatic Speech Recognition is affected by various factors which include vocabulary size, whether system is speaker dependent or independent, whether speech is isolated, discontinuous or continuous, adverse conditions like noise. The paper presented an ensemble architecture that uses PLP for feature extraction at the front end and a hybrid of SGMM + Dan's DNN in the backend to build a noise robust ASR system Discussion: The presented work in this paper discusses the performance comparison of continuous ASR systems developed using different combinations of front-end feature extraction (MFCC, PLP, and GFCC) and back-end acoustic modeling (mono-phone, tri-phone, SGMM, DNN and hybrid DNN-SGMM) techniques. Each type of front-end technique is tested in combination with each type of back-end technique. Finally, it compares the results of the combinations thus formed, to find out the best performing combination in noisy and clean conditions


2018 ◽  
Vol 29 (1) ◽  
pp. 959-976
Author(s):  
Mohit Dua ◽  
Rajesh Kumar Aggarwal ◽  
Mantosh Biswas

Abstract An automatic speech recognition (ASR) system translates spoken words or utterances (isolated, connected, continuous, and spontaneous) into text format. State-of-the-art ASR systems mainly use Mel frequency (MF) cepstral coefficient (MFCC), perceptual linear prediction (PLP), and Gammatone frequency (GF) cepstral coefficient (GFCC) for extracting features in the training phase of the ASR system. Initially, the paper proposes a sequential combination of all three feature extraction methods, taking two at a time. Six combinations, MF-PLP, PLP-MFCC, MF-GFCC, GF-MFCC, GF-PLP, and PLP-GFCC, are used, and the accuracy of the proposed system using all these combinations was tested. The results show that the GF-MFCC and MF-GFCC integrations outperform all other proposed integrations. Further, these two feature vector integrations are optimized using three different optimization methods, particle swarm optimization (PSO), PSO with crossover, and PSO with quadratic crossover (Q-PSO). The results demonstrate that the Q-PSO-optimized GF-MFCC integration show significant improvement over all other optimized combinations.


Author(s):  
Deepang Raval ◽  
Vyom Pathak ◽  
Muktan Patel ◽  
Brijesh Bhatt

We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning-based approach that includes Convolutional Neural Network, Bi-directional Long Short Term Memory layers, Dense layers, and Connectionist Temporal Classification as a loss function. To improve the performance of the system with the limited size of the dataset, we present a combined language model (Word-level language Model and Character-level language model)-based prefix decoding technique and Bidirectional Encoder Representations from Transformers-based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we used the inferences from the system and proposed different analysis methods. These insights help us in understanding and improving the ASR system as well as provide intuition into the language used for the ASR system. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.87% decrease in Word Error Rate (WER) with respect to base-model WER.


2014 ◽  
Vol 679 ◽  
pp. 189-193 ◽  
Author(s):  
Rosemary T. Salaja ◽  
Ronan Flynn ◽  
Michael Russell

Research in speech recognition has produced different approaches that have been used for the classification of speech utterances in the back-end of an automatic speech recognition (ASR) system. As speech recognition is a pattern recognition problem, classification is an important part of any speech recognition system. This paper proposes a new back-end classifier that is based on artificial life (ALife) and describes how the proposed classifier can be used in a speech recognition system.


2013 ◽  
Vol 10 (1) ◽  
pp. 219-230 ◽  
Author(s):  
Jovica Tasevski ◽  
Milutin Nikolic ◽  
Dragisa Miskovic

The paper reports a solution for the integration of the industrial robot ABB IRB140 with the system for automatic speech recognition (ASR) and the system for computer vision. The robot has the task to manipulate the objects placed randomly on a pad lying on a table, and the computer vision system has to recognize their characteristics (shape, dimension, color, position, and orientation). The ASR system has a task to recognize human speech and use it as a command to the robot, so the robot can manipulate the objects.


The present manuscript focuses on building automatic speech recognition (ASR) system for Marathi language (M-ASR) using Hidden Markov Model Toolkit (HTK). The M-ASR system gives the detail about experimentation and implementation using the HTK Toolkit. In this work total 106 speaker independent Marathi isolated words were recognized. These unique Marathi words are used to train and evaluate M-ASR system. The speech corpus (database) is created by us using isolated Marathi words uttered with mixed gender people. The system uses Mel Frequency Cepstral Coefficient (MFCC) for the purpose of extracting features using Gaussian mixture model (GMM). Viterbi algorithm based on token passing is used for decoding to recognize unknown utterances. The proposed M-ASR system is speaker independent. The proposed system has reported 96.23% word level recognition accuracy.


Computers ◽  
2019 ◽  
Vol 8 (4) ◽  
pp. 76 ◽  
Author(s):  
Laurynas Pipiras ◽  
Rytis Maskeliūnas ◽  
Robertas Damaševičius

Automatic speech recognition (ASR) has been one of the biggest and hardest challenges in the field. A large majority of research in this area focuses on widely spoken languages such as English. The problems of automatic Lithuanian speech recognition have attracted little attention so far. Due to complicated language structure and scarcity of data, models proposed for other languages such as English cannot be directly adopted for Lithuanian. In this paper we propose an ASR system for the Lithuanian language, which is based on deep learning methods and can identify spoken words purely from their phoneme sequences. Two encoder-decoder models are used to solve the ASR task: a traditional encoder-decoder model and a model with attention mechanism. The performance of these models is evaluated in isolated speech recognition task (with an accuracy of 0.993) and long phrase recognition task (with an accuracy of 0.992).


Sign in / Sign up

Export Citation Format

Share Document