scholarly journals Lithuanian Speech Recognition Using Purely Phonetic Deep Learning

Computers ◽  
2019 ◽  
Vol 8 (4) ◽  
pp. 76 ◽  
Author(s):  
Laurynas Pipiras ◽  
Rytis Maskeliūnas ◽  
Robertas Damaševičius

Automatic speech recognition (ASR) has been one of the biggest and hardest challenges in the field. A large majority of research in this area focuses on widely spoken languages such as English. The problems of automatic Lithuanian speech recognition have attracted little attention so far. Due to complicated language structure and scarcity of data, models proposed for other languages such as English cannot be directly adopted for Lithuanian. In this paper we propose an ASR system for the Lithuanian language, which is based on deep learning methods and can identify spoken words purely from their phoneme sequences. Two encoder-decoder models are used to solve the ASR task: a traditional encoder-decoder model and a model with attention mechanism. The performance of these models is evaluated in isolated speech recognition task (with an accuracy of 0.993) and long phrase recognition task (with an accuracy of 0.992).

Author(s):  
Deepang Raval ◽  
Vyom Pathak ◽  
Muktan Patel ◽  
Brijesh Bhatt

We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning-based approach that includes Convolutional Neural Network, Bi-directional Long Short Term Memory layers, Dense layers, and Connectionist Temporal Classification as a loss function. To improve the performance of the system with the limited size of the dataset, we present a combined language model (Word-level language Model and Character-level language model)-based prefix decoding technique and Bidirectional Encoder Representations from Transformers-based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we used the inferences from the system and proposed different analysis methods. These insights help us in understanding and improving the ASR system as well as provide intuition into the language used for the ASR system. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.87% decrease in Word Error Rate (WER) with respect to base-model WER.


2021 ◽  
Author(s):  
Matheus Xavier Sampaio ◽  
Regis Pires Magalhães ◽  
Ticiana Linhares Coelho da Silva ◽  
Lívia Almada Cruz ◽  
Davi Romero de Vasconcelos ◽  
...  

Automatic Speech Recognition (ASR) is an essential task for many applications like automatic caption generation for videos, voice search, voice commands for smart homes, and chatbots. Due to the increasing popularity of these applications and the advances in deep learning models for transcribing speech into text, this work aims to evaluate the performance of commercial solutions for ASR that use deep learning models, such as Facebook Wit.ai, Microsoft Azure Speech, and Google Cloud Speech-to-Text. The results demonstrate that the evaluated solutions slightly differ. However, Microsoft Azure Speech outperformed the other analyzed APIs.


2021 ◽  
Vol 11 (19) ◽  
pp. 8872
Author(s):  
Iván G. Torre ◽  
Mónica Romero ◽  
Aitor Álvarez

Automatic speech recognition in patients with aphasia is a challenging task for which studies have been published in a few languages. Reasonably, the systems reported in the literature within this field show significantly lower performance than those focused on transcribing non-pathological clean speech. It is mainly due to the difficulty of recognizing a more unintelligible voice, as well as due to the scarcity of annotated aphasic data. This work is mainly focused on applying novel semi-supervised learning methods to the AphasiaBank dataset in order to deal with these two major issues, reporting improvements for the English language and providing the first benchmark for the Spanish language for which less than one hour of transcribed aphasic speech was used for training. In addition, the influence of reinforcing the training and decoding processes with out-of-domain acoustic and text data is described by using different strategies and configurations to fine-tune the hyperparameters and the final recognition systems. The interesting results obtained encourage extending this technological approach to other languages and scenarios where the scarcity of annotated data to train recognition models is a challenging reality.


2017 ◽  
Vol 60 (9) ◽  
pp. 2394-2405 ◽  
Author(s):  
Lionel Fontan ◽  
Isabelle Ferrané ◽  
Jérôme Farinas ◽  
Julien Pinquier ◽  
Julien Tardieu ◽  
...  

Purpose The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist audiologists/hearing-aid dispensers in the fine-tuning of hearing aids. Method Sixty young participants with normal hearing listened to speech materials mimicking the perceptual consequences of ARHL at different levels of severity. Two intelligibility tests (repetition of words and sentences) and 1 comprehension test (responding to oral commands by moving virtual objects) were administered. Several language models were developed and used by the ASR system in order to fit human performances. Results Strong significant positive correlations were observed between human and ASR scores, with coefficients up to .99. However, the spectral smearing used to simulate losses in frequency selectivity caused larger declines in ASR performance than in human performance. Conclusion Both intelligibility and comprehension scores for listeners with simulated ARHL are highly correlated with the performances of an ASR-based system. In the future, it needs to be determined if the ASR system is similarly successful in predicting speech processing in noise and by older people with ARHL.


Author(s):  
Ramy Mounir ◽  
Redwan Alqasemi ◽  
Rajiv Dubey

This work focuses on the research related to enabling individuals with speech impairment to use speech-to-text software to recognize and dictate their speech. Automatic Speech Recognition (ASR) tends to be a challenging problem for researchers because of the wide range of speech variability. Some of the variabilities include different accents, pronunciations, speeds, volumes, etc. It is very difficult to train an end-to-end speech recognition model on data with speech impediment due to the lack of large enough datasets, and the difficulty of generalizing a speech disorder pattern on all users with speech impediments. This work highlights the different techniques used in deep learning to achieve ASR and how it can be modified to recognize and dictate speech from individuals with speech impediments.


2020 ◽  
Vol 8 (5) ◽  
pp. 1677-1681

Stuttering or Stammering is a speech defect within which sounds, syllables, or words are rehashed or delayed, disrupting the traditional flow of speech. Stuttering can make it hard to speak with other individuals, which regularly have an effect on an individual's quality of life. Automatic Speech Recognition (ASR) system is a technology that converts audio speech signal into corresponding text. Presently ASR systems play a major role in controlling or providing inputs to the various applications. Such an ASR system and Machine Translation Application suffers a lot due to stuttering (speech dysfluency). Dysfluencies will affect the phrase consciousness accuracy of an ASR, with the aid of increasing word addition, substitution and dismissal rates. In this work we focused on detecting and removing the prolongation, silent pauses and repetition to generate proper text sequence for the given stuttered speech signal. The stuttered speech recognition consists of two stages namely classification using LSTM and testing in ASR. The major phases of classification system are Re-sampling, Segmentation, Pre-Emphasis, Epoch Extraction and Classification. The current work is carried out in UCLASS Stuttering dataset using MATLAB with 4% to 6% increase in accuracy when compare with ANN and SVM.


Author(s):  
Mohit Dua ◽  
Pawandeep Singh Sethi ◽  
Vinam Agrawal ◽  
Raghav Chawla

Introduction: An Automatic Speech Recognition (ASR) system enables to recognize the speech utterances and thus can be used to convert speech into text for various purposes. These systems are deployed in different environments such as clean or noisy and are used by all ages or types of people. These also present some of the major difficulties faced in the development of an ASR system. Thus, an ASR system need to be efficient, while also being accurate and robust. Our main goal is to minimize the error rate during training as well as testing phases, while implementing an ASR system. Performance of ASR depends upon different combinations of feature extraction techniques and back-end techniques. In this paper, using a continuous speech recognition system, the performance comparison of different combinations of feature extraction techniques and various types of back-end techniques has been presented Methods: Hidden Markov Models (HMMs), Subspace Gaussian Mixture Models (SGMMs) and Deep Neural Networks (DNNs) with DNN-HMM architecture, namely Karel's, Dan's and Hybrid DNN-SGMM architecture are used at the back-end of the implemented system. Mel frequency Cepstral Coefficient (MFCC), Perceptual Linear Prediction (PLP), and Gammatone Frequency Cepstral coefficients (GFCC) are used as feature extraction techniques at the front-end of the proposed system. Kaldi toolkit has been used for the implementation of the proposed work. The system is trained on the Texas Instruments-Massachusetts Institute of Technology (TIMIT) speech corpus for English language Results: The experimental results show that MFCC outperforms GFCC and PLP in noiseless conditions, while PLP tends to outperform MFCC and GFCC in noisy conditions. Furthermore, the hybrid of Dan's DNN implementation along with SGMM performs the best for the back-end acoustic modeling. The proposed architecture with PLP feature extraction technique in the front end and hybrid of Dan's DNN implementation along with SGMM at the back end outperforms the other combinations in a noisy environment. Conclusion: Automatic Speech recognition has numerous applications in our lives like Home automation, Personal assistant, Robotics etc. It is highly desirable to build an ASR system with good performance. The performance Automatic Speech Recognition is affected by various factors which include vocabulary size, whether system is speaker dependent or independent, whether speech is isolated, discontinuous or continuous, adverse conditions like noise. The paper presented an ensemble architecture that uses PLP for feature extraction at the front end and a hybrid of SGMM + Dan's DNN in the backend to build a noise robust ASR system Discussion: The presented work in this paper discusses the performance comparison of continuous ASR systems developed using different combinations of front-end feature extraction (MFCC, PLP, and GFCC) and back-end acoustic modeling (mono-phone, tri-phone, SGMM, DNN and hybrid DNN-SGMM) techniques. Each type of front-end technique is tested in combination with each type of back-end technique. Finally, it compares the results of the combinations thus formed, to find out the best performing combination in noisy and clean conditions


Sign in / Sign up

Export Citation Format

Share Document