Hierarchical Phoneme Classification for Improved Speech Recognition

Donghoon Oh; Jeong-Sik Park; Ji-Hwan Kim; Gil-Jin Jang

doi:10.3390/app11010428

Hierarchical Phoneme Classification for Improved Speech Recognition

Applied Sciences ◽

10.3390/app11010428 ◽

2021 ◽

Vol 11 (1) ◽

pp. 428

Author(s):

Donghoon Oh ◽

Jeong-Sik Park ◽

Ji-Hwan Kim ◽

Gil-Jin Jang

Keyword(s):

Speech Recognition ◽

Language Processing ◽

Confusion Matrix ◽

Critical Factor ◽

Recognition System ◽

Classification Performance ◽

Language Models ◽

Successful Implementation ◽

Phoneme Classification ◽

Improved Performance

Speech recognition consists of converting input sound into a sequence of phonemes, then finding text for the input using language models. Therefore, phoneme classification performance is a critical factor for the successful implementation of a speech recognition system. However, correctly distinguishing phonemes with similar characteristics is still a challenging problem even for state-of-the-art classification methods, and the classification errors are hard to be recovered in the subsequent language processing steps. This paper proposes a hierarchical phoneme clustering method to exploit more suitable recognition models to different phonemes. The phonemes of the TIMIT database are carefully analyzed using a confusion matrix from a baseline speech recognition model. Using automatic phoneme clustering results, a set of phoneme classification models optimized for the generated phoneme groups is constructed and integrated into a hierarchical phoneme classification method. According to the results of a number of phoneme classification experiments, the proposed hierarchical phoneme group models improved performance over the baseline by 3%, 2.1%, 6.0%, and 2.2% for fricative, affricate, stop, and nasal sounds, respectively. The average accuracy was 69.5% and 71.7% for the baseline and proposed hierarchical models, showing a 2.2% overall improvement.

Download Full-text

Soft-Computational Techniques and Spectro-Temporal Features for Telephonic Speech Recognition

Advances in Computational Intelligence and Robotics - Handbook of Research on Advanced Hybrid Intelligent Techniques and Applications ◽

10.4018/978-1-4666-9474-3.ch006 ◽

2016 ◽

pp. 161-189 ◽

Cited By ~ 2

Author(s):

Mridusmita Sharma ◽

Kandarpa Kumar Sarma

Keyword(s):

Speech Recognition ◽

Language Processing ◽

Model Building ◽

Recognition System ◽

Computational Techniques ◽

Speech Technology ◽

Temporal Features ◽

Significant Performance ◽

Telephone Speech ◽

New Applications

Speech is the natural communication means, however, it is not the typical input means afforded by computers. The interaction between humans and machines would have become easier, if speech were an alternative effective input means to the keyboard and mouse. With advancement in techniques for signal processing and model building and the empowerment of computing devices, significant progress has been made in speech recognition research, and various speech based applications have been developed. With rapid advancement of the speech recognition technology, telephone speech technology are getting more involved in many new applications of spoken language processing. From the literature it has been found that the spectro-temporal features gives a significant performance improvement for telephone speech recognition system in comparison to the robust feature techniques used for the recognition purpose. In this chapter, the authors have reported the use of various spectral and temporal features and the soft computing techniques that have been used for the telephonic speech recognition.

Download Full-text

Merging of language models from two or more application programs for a speech recognition system

The Journal of the Acoustical Society of America ◽

10.1121/1.423215 ◽

1998 ◽

Vol 103 (3) ◽

pp. 1250

Author(s):

Matthew G. Pallakoff

Keyword(s):

Speech Recognition ◽

Recognition System ◽

Language Models ◽

Speech Recognition System ◽

Application Programs

Download Full-text

Continuous kannada speech segmentation and speech recognition based on threshold using MFCC And VQ

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v9i6.pp4684-4695 ◽

2019 ◽

Vol 9 (6) ◽

pp. 4684

Author(s):

Vanajakshi Puttaswamy Gowda ◽

Mathivanan Murugavelu ◽

Senthil Kumaran Thangamuthu

Keyword(s):

Speech Recognition ◽

Language Processing ◽

Speech Signal ◽

Recognition Rate ◽

Recognition System ◽

Training Data ◽

Speech Segmentation ◽

Significant Feature ◽

Mel Frequency Cepstral Coefficients ◽

Simple Method

<p><span>Continuous speech segmentation and its recognition is playing important role in natural language processing. Continuous context based Kannada speech segmentation depends on context, grammer and semantics rules present in the kannada language. The significant feature extraction of kannada speech signal for recognition system is quite exciting for researchers. In this paper proposed method is divided into two parts. First part of the method is continuous kannada speech signal segmentation with respect to the context based is carried out by computing average short term energy and its spectral centroid coefficients of the speech signal present in the specified window. The segmented outputs are completely meaningful segmentation for different scenarios with less segmentation error. The second part of the method is speech recognition by extracting less number Mel frequency cepstral coefficients with less number of codebooks using vector quantization .In this recognition is completely based on threshold value.This threshold setting is a challenging task however the simple method is used to achieve better recognition rate.The experimental results shows more efficient and effective segmentation with high recognition rate for any continuous context based kannada speech signal with different accents for male and female than the existing methods and also used minimal feature dimensions for training data.</span></p>

Download Full-text

Sentiment Analysis Using Natural Language Processing Through a Speech Recognition System Using a Hybrid Mobile App

Technological and Industrial Applications Associated With Industry 4.0 - Studies in Systems, Decision and Control ◽

10.1007/978-3-030-68663-5_10 ◽

2021 ◽

pp. 141-153

Author(s):

Alejandro Acosta ◽

Alberto Ochoa-Zezzatti ◽

Lina M. Aguilar-Lobo ◽

Gilberto Ochoa-Ruiz

Keyword(s):

Natural Language Processing ◽

Speech Recognition ◽

Natural Language ◽

Sentiment Analysis ◽

Language Processing ◽

Recognition System ◽

Mobile App ◽

Speech Recognition System

Download Full-text

USING DATA-DRIVEN SUBWORD UNITS IN LANGUAGE MODEL OF HIGHLY INFLECTIVE SLOVENIAN LANGUAGE

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001409007119 ◽

2009 ◽

Vol 23 (02) ◽

pp. 287-312 ◽

Cited By ~ 1

Author(s):

MIRJAM SEPESY MAUČEC ◽

TOMAŽ ROTOVNIK ◽

ZDRAVKO KAČIČ ◽

JANEZ BREST

Keyword(s):

Speech Recognition ◽

Language Model ◽

Recognition System ◽

Language Models ◽

Data Driven ◽

Inflectional Morphology ◽

Minimum Entropy ◽

Test Set ◽

Training Corpus ◽

Speech Database

This paper presents the results of a study on modeling the highly inflective Slovenian language. We focus on creating a language model for a large vocabulary speech recognition system. A new data-driven method is proposed for the induction of inflectional morphology into language modeling. The research focus is on data sparsity, which results from the complex morphology of the language. The idea of using subword units is examined. An attempt is made to figure out the segmentation of words into two subword units: stems and endings. No prior knowledge of the language is used. The subword units should fit into the frameworks of the probabilistic language models. A morphologically correct decomposition of words is not being sought, but searching for a decomposition which yields the minimum entropy of the training corpus. This entropy is approximated by using N-gram models. Despite some seemingly over-simplified assumption, the subword models improve the applicability of the language models for a sparse training corpus. The experiments were performed using the VEČER newswire text corpus as a training corpus. The test set was taken from the SNABI speech database, because the final models were evaluated in speech recognition experiments on SNABI speech database. Two different subword-based models are proposed and examined experimentally. The experiments demonstrate that subword-based models, which considerably reduce OOV rate, improve speech recognition WER when compared with standard word-based models, even though they increase test set perplexity. Subword-based models with improved perplexity, but which reduce the OOV rate much less than the previous ones, do not improve speech recognition results.

Download Full-text

Language Models Creation for the Tatar Speech Recognition System

Indian Journal of Science and Technology ◽

10.17485/ijst/2017/v10i1/109954 ◽

2017 ◽

Vol 10 (1) ◽

Author(s):

Aidar Failovich Khusainov

Keyword(s):

Speech Recognition ◽

Recognition System ◽

Language Models ◽

Speech Recognition System

Download Full-text

Integration of multiple acoustic and language models for improved Hindi speech recognition system

International Journal of Speech Technology ◽

10.1007/s10772-012-9131-y ◽

2012 ◽

Vol 15 (2) ◽

pp. 165-180 ◽

Cited By ~ 11

Author(s):

R. K. Aggarwal ◽

M. Dave

Keyword(s):

Speech Recognition ◽

Recognition System ◽

Language Models ◽

Speech Recognition System

Download Full-text

Building Language Models for Tamil Speech Recognition System

Lecture Notes in Computer Science - Applied Computing ◽

10.1007/978-3-540-30176-9_21 ◽

2004 ◽

pp. 161-168 ◽

Cited By ~ 1

Author(s):

S. Saraswathi ◽

T. V. Geetha

Keyword(s):

Speech Recognition ◽

Recognition System ◽

Language Models ◽

Speech Recognition System

Download Full-text

Enhance Image Classification Performance Via Unsupervised Pre-trained Transformers Language Models

10.21203/rs.3.rs-93060/v1 ◽

2020 ◽

Author(s):

Dezhou Shen

Keyword(s):

Image Classification ◽

Language Processing ◽

Classification Performance ◽

Language Models ◽

Classification Model ◽

Linear Probe ◽

Image Set ◽

Image Pixels ◽

The Difference ◽

Fully Connected

Abstract Image classification and categorization are essential to the capability of telling the difference between images for a machine. As Bidirectional Encoder Representations from Transformers became popular in many tasks of natural language processing recent years, it is intuitive to use these pre-trained language models for enhancing the computer vision tasks, \eg image classification. In this paper, by encoding image pixels using pre-trained transformers, then connect to a fully connected layer, the classification model outperforms the Wide ResNet model and the linear-probe iGPT-L model, and achieved accuracy of 99.60%～99.74% on the CIFAR-10 image set and accuracy of 99.10%～99.76% on the CIFAR-100 image set.

Download Full-text

Speech Recognition

10.1093/oxfordhb/9780199276349.013.0016 ◽

2012 ◽

Cited By ~ 2

Author(s):

Lori Lamel ◽

Jean-Luc Gauvain

Keyword(s):

Speech Recognition ◽

Recognition System ◽

Lexical Representation ◽

Language Models ◽

Future Research ◽

Continuous Speech Recognition ◽

Retrieval Systems ◽

Language Modelling ◽

Representation Language ◽

Main Components

Speech recognition is concerned with converting the speech waveform, an acoustic signal, into a sequence of words. Today's approaches are based on a statistical modellization of the speech signal. This article provides an overview of the main topics addressed in speech recognition, which are, acoustic-phonetic modelling, lexical representation, language modelling, decoding, and model adaptation. Language models are used in speech recognition to estimate the probability of word sequences. The main components of a generic speech recognition system are, main knowledge sources, feature analysis, and acoustic and language models, which are estimated in a training phase, and the decoder. The focus of this article is on methods used in state-of-the-art speaker-independent, large-vocabulary continuous speech recognition (LVCSR). Primary application areas for such technology are dictation, spoken language dialogue, and transcription for information archival and retrieval systems. Finally, this article discusses issues and directions of future research.

Download Full-text