scholarly journals Hierarchical Phoneme Classification for Improved Speech Recognition

2021 ◽  
Vol 11 (1) ◽  
pp. 428
Author(s):  
Donghoon Oh ◽  
Jeong-Sik Park ◽  
Ji-Hwan Kim ◽  
Gil-Jin Jang

Speech recognition consists of converting input sound into a sequence of phonemes, then finding text for the input using language models. Therefore, phoneme classification performance is a critical factor for the successful implementation of a speech recognition system. However, correctly distinguishing phonemes with similar characteristics is still a challenging problem even for state-of-the-art classification methods, and the classification errors are hard to be recovered in the subsequent language processing steps. This paper proposes a hierarchical phoneme clustering method to exploit more suitable recognition models to different phonemes. The phonemes of the TIMIT database are carefully analyzed using a confusion matrix from a baseline speech recognition model. Using automatic phoneme clustering results, a set of phoneme classification models optimized for the generated phoneme groups is constructed and integrated into a hierarchical phoneme classification method. According to the results of a number of phoneme classification experiments, the proposed hierarchical phoneme group models improved performance over the baseline by 3%, 2.1%, 6.0%, and 2.2% for fricative, affricate, stop, and nasal sounds, respectively. The average accuracy was 69.5% and 71.7% for the baseline and proposed hierarchical models, showing a 2.2% overall improvement.

Author(s):  
Mridusmita Sharma ◽  
Kandarpa Kumar Sarma

Speech is the natural communication means, however, it is not the typical input means afforded by computers. The interaction between humans and machines would have become easier, if speech were an alternative effective input means to the keyboard and mouse. With advancement in techniques for signal processing and model building and the empowerment of computing devices, significant progress has been made in speech recognition research, and various speech based applications have been developed. With rapid advancement of the speech recognition technology, telephone speech technology are getting more involved in many new applications of spoken language processing. From the literature it has been found that the spectro-temporal features gives a significant performance improvement for telephone speech recognition system in comparison to the robust feature techniques used for the recognition purpose. In this chapter, the authors have reported the use of various spectral and temporal features and the soft computing techniques that have been used for the telephonic speech recognition.


Author(s):  
Vanajakshi Puttaswamy Gowda ◽  
Mathivanan Murugavelu ◽  
Senthil Kumaran Thangamuthu

<p><span>Continuous speech segmentation and its  recognition is playing important role in natural language processing. Continuous context based Kannada speech segmentation depends  on context, grammer and semantics rules present in the kannada language. The significant feature extraction of kannada speech signal  for recognition system is quite exciting for researchers. In this paper proposed method  is  divided into two parts. First part of the method is continuous kannada speech signal segmentation with respect to the context based is carried out  by computing  average short term energy and its spectral centroid coefficients of  the speech signal present in the specified window. The segmented outputs are completely  meaningful  segmentation  for different scenarios with less segmentation error. The second part of the method is speech recognition by extracting less number Mel frequency cepstral coefficients with less  number of codebooks  using vector quantization .In this recognition is completely based on threshold value.This threshold setting is a challenging task however the simple method is used to achieve better recognition rate.The experimental results shows more efficient  and effective segmentation    with high recognition rate for any continuous context based kannada speech signal with different accents for male and female than the existing methods and also used minimal feature dimensions for training data.</span></p>


Author(s):  
MIRJAM SEPESY MAUČEC ◽  
TOMAŽ ROTOVNIK ◽  
ZDRAVKO KAČIČ ◽  
JANEZ BREST

This paper presents the results of a study on modeling the highly inflective Slovenian language. We focus on creating a language model for a large vocabulary speech recognition system. A new data-driven method is proposed for the induction of inflectional morphology into language modeling. The research focus is on data sparsity, which results from the complex morphology of the language. The idea of using subword units is examined. An attempt is made to figure out the segmentation of words into two subword units: stems and endings. No prior knowledge of the language is used. The subword units should fit into the frameworks of the probabilistic language models. A morphologically correct decomposition of words is not being sought, but searching for a decomposition which yields the minimum entropy of the training corpus. This entropy is approximated by using N-gram models. Despite some seemingly over-simplified assumption, the subword models improve the applicability of the language models for a sparse training corpus. The experiments were performed using the VEČER newswire text corpus as a training corpus. The test set was taken from the SNABI speech database, because the final models were evaluated in speech recognition experiments on SNABI speech database. Two different subword-based models are proposed and examined experimentally. The experiments demonstrate that subword-based models, which considerably reduce OOV rate, improve speech recognition WER when compared with standard word-based models, even though they increase test set perplexity. Subword-based models with improved perplexity, but which reduce the OOV rate much less than the previous ones, do not improve speech recognition results.


2020 ◽  
Author(s):  
Dezhou Shen

Abstract Image classification and categorization are essential to the capability of telling the difference between images for a machine. As Bidirectional Encoder Representations from Transformers became popular in many tasks of natural language processing recent years, it is intuitive to use these pre-trained language models for enhancing the computer vision tasks, \eg image classification. In this paper, by encoding image pixels using pre-trained transformers, then connect to a fully connected layer, the classification model outperforms the Wide ResNet model and the linear-probe iGPT-L model, and achieved accuracy of 99.60%~99.74% on the CIFAR-10 image set and accuracy of 99.10%~99.76% on the CIFAR-100 image set.


Author(s):  
Lori Lamel ◽  
Jean-Luc Gauvain

Speech recognition is concerned with converting the speech waveform, an acoustic signal, into a sequence of words. Today's approaches are based on a statistical modellization of the speech signal. This article provides an overview of the main topics addressed in speech recognition, which are, acoustic-phonetic modelling, lexical representation, language modelling, decoding, and model adaptation. Language models are used in speech recognition to estimate the probability of word sequences. The main components of a generic speech recognition system are, main knowledge sources, feature analysis, and acoustic and language models, which are estimated in a training phase, and the decoder. The focus of this article is on methods used in state-of-the-art speaker-independent, large-vocabulary continuous speech recognition (LVCSR). Primary application areas for such technology are dictation, spoken language dialogue, and transcription for information archival and retrieval systems. Finally, this article discusses issues and directions of future research.


Sign in / Sign up

Export Citation Format

Share Document