Adaptive feature truncation to address acoustic mismatch in automatic recognition of children's speech

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2016.16 ◽

2016 ◽

Vol 5 ◽

Author(s):

Shweta Ghai ◽

Rohit Sinha

Keyword(s):

Speech Recognition ◽

Vocal Tract ◽

Recognition Task ◽

Automatic Recognition ◽

Continuous Speech Recognition ◽

Mel Frequency Cepstral Coefficients ◽

Acoustic Mismatch ◽

Constrained Maximum Likelihood ◽

Maximum Likelihood Linear Regression ◽

Children's Speech

An algorithm for adaptive Mel frequency cepstral coefficients (MFCC) feature truncation is proposed to improve automatic speech recognition (ASR) performance under acoustically mismatched conditions. Using the relationship found between MFCC base feature truncation and degree of acoustic mismatch of speech signals with respect to recognition models, the proposed algorithm performs utterance-specific MFCC feature truncation for test signals to address their acoustic mismatch in context of ASR. The proposed technique, without any prior knowledge about the speaker of the test utterance, gives 38% (on a connected-digit recognition task) and 36% (on a continuous speech recognition task) relative improvement over baseline in ASR performance for children's speech on models trained on adult speech, which is also found to be additive to improvements obtained with vocal tract length normalization and/or constrained maximum likelihood linear regression. The generality and effectiveness of the algorithm is also validated for automatic recognition of children's and adults' speech under matched and mismatched conditions.

Download Full-text

Adaptive training with noisy constrained maximum likelihood linear regression for noise robust speech recognition

10.21437/interspeech.2009-364 ◽

2009 ◽

Author(s):

D. K. Kim ◽

M. J. F. Gales

Keyword(s):

Speech Recognition ◽

Maximum Likelihood ◽

Linear Regression ◽

Robust Speech Recognition ◽

Adaptive Training ◽

Noise Robust Speech Recognition ◽

Noise Robust ◽

Constrained Maximum Likelihood ◽

Maximum Likelihood Linear Regression

Download Full-text

Regularized constrained maximum likelihood linear regression for speech recognition

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2014.6854820 ◽

2014 ◽

Author(s):

Sina Hamidi Ghalehjegh ◽

Richard C. Rose

Keyword(s):

Speech Recognition ◽

Maximum Likelihood ◽

Linear Regression ◽

Constrained Maximum Likelihood ◽

Maximum Likelihood Linear Regression

Download Full-text

Creation and Instigation of Triphone based Big-Lexicon Speaker-Independent Continuous Speech Recognition Framework for Kannada Language

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b1090.1292s19 ◽

2019 ◽

Vol 9 (2S) ◽

pp. 152-158

Keyword(s):

Speech Recognition ◽

Recognition Rate ◽

Continuous Speech ◽

Continuous Speech Recognition ◽

Mel Frequency Cepstral Coefficients ◽

Speech Corpus ◽

Linear Discriminant ◽

Geographical Regions ◽

Speech Data ◽

Speech Information

This paper proposes a framework that is intended to do the comparably accurate recognition of speech and in precise, continuous speech recognition (CSR) based on triphone modelling for Kannada dialect. For designing the proposed framework, the features from the speech data are obtained from the well-known feature extraction technique Mel-frequency cepstral coefficients (MFCC) and from its transformations, like, linear discriminant analysis (LDA) and maximum likelihood linear transforms (MLLT) are obtained from Kannada speech data files. At that point, the system is trained to evaluate the hidden Markov model (HMM) parameters for continuous speech (CS) data. The persistent Kannada speech information is gathered from 2600 speakers (1560 men and 1040women) of the age bunch in the scope of 14 years-80 years. The speech information is acquired from different geographical regions of the Karnataka (one of the 29 states situated in the southern part of India) state under degraded condition. It comprises of 21,551 words that spread 30 locales. The performance evaluation of both monophone and triphone models concerning word error rate (WER) is done and the obtained results are compared with the standard databases such as TIMIT and aurora4. A significant reduction in WER is obtained for triphone models. The speech recognition (SR) rate is verified for both offline and online recognition mode for all the speakers. The results reveal that the recognition rate (RR) for Kannada speech corpus has got a better improvement over the state-of-the-art existing databases.

Download Full-text

Features Extraction for Lhasa Tibetan Speech Recognition

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.571-572.205 ◽

2014 ◽

Vol 571-572 ◽

pp. 205-208

Author(s):

Guan Yu Li ◽

Hong Zhi Yu ◽

Yong Hong Li ◽

Ning Ma

Keyword(s):

Speech Recognition ◽

Linear Prediction ◽

Recognition System ◽

Continuous Speech Recognition ◽

Mel Frequency Cepstral Coefficients ◽

Linear Prediction Coefficient ◽

Speech Feature ◽

Perceptual Linear Prediction ◽

Prediction Coefficient ◽

Speech Feature Extraction

Speech feature extraction is discussed. Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction coefficient (PLP) method is analyzed. These two types of features are extracted in Lhasa large vocabulary continuous speech recognition system. Then the recognition results are compared.

Download Full-text

A Hybrid of Deep CNN and Bidirectional LSTM for Automatic Speech Recognition

Journal of Intelligent Systems ◽

10.1515/jisys-2018-0372 ◽

2019 ◽

Vol 29 (1) ◽

pp. 1261-1274 ◽

Cited By ~ 5

Author(s):

Vishal Passricha ◽

Rajesh Kumar Aggarwal

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Speech Signal ◽

Short Term Memory ◽

Recognition Rate ◽

Recognition Task ◽

Acoustic Modeling ◽

Hybrid Architecture ◽

Continuous Speech Recognition ◽

Temporal Properties

Abstract Deep neural networks (DNNs) have been playing a significant role in acoustic modeling. Convolutional neural networks (CNNs) are the advanced version of DNNs that achieve 4–12% relative gain in the word error rate (WER) over DNNs. Existence of spectral variations and local correlations in speech signal makes CNNs more capable of speech recognition. Recently, it has been demonstrated that bidirectional long short-term memory (BLSTM) produces higher recognition rate in acoustic modeling because they are adequate to reinforce higher-level representations of acoustic data. Spatial and temporal properties of the speech signal are essential for high recognition rate, so the concept of combining two different networks came into mind. In this paper, a hybrid architecture of CNN-BLSTM is proposed to appropriately use these properties and to improve the continuous speech recognition task. Further, we explore different methods like weight sharing, the appropriate number of hidden units, and ideal pooling strategy for CNN to achieve a high recognition rate. Specifically, the focus is also on how many BLSTM layers are effective. This paper also attempts to overcome another shortcoming of CNN, i.e. speaker-adapted features, which are not possible to be directly modeled in CNN. Next, various non-linearities with or without dropout are analyzed for speech tasks. Experiments indicate that proposed hybrid architecture with speaker-adapted features and maxout non-linearity with dropout idea shows 5.8% and 10% relative decrease in WER over the CNN and DNN systems, respectively.

Download Full-text

Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition

IEEE Transactions on Speech and Audio Processing ◽

10.1109/89.784104 ◽

1999 ◽

Vol 7 (5) ◽

pp. 525-532 ◽

Cited By ~ 96

Author(s):

R. Vergin ◽

D. O'Shaughnessy ◽

A. Farhat

Keyword(s):

Speech Recognition ◽

Continuous Speech ◽

Continuous Speech Recognition ◽

Mel Frequency Cepstral Coefficients ◽

Large Vocabulary ◽

Speaker Independent ◽

Cepstral Coefficients

Download Full-text

Noisy Constrained Maximum-Likelihood Linear Regression for Noise-Robust Speech Recognition

IEEE Transactions on Audio Speech and Language Processing ◽

10.1109/tasl.2010.2047756 ◽

2011 ◽

Vol 19 (2) ◽

pp. 315-325 ◽

Cited By ~ 24

Author(s):

D. K. Kim ◽

M. J. F. Gales

Keyword(s):

Speech Recognition ◽

Maximum Likelihood ◽

Linear Regression ◽

Robust Speech Recognition ◽

Noise Robust Speech Recognition ◽

Noise Robust ◽

Constrained Maximum Likelihood ◽

Maximum Likelihood Linear Regression

Download Full-text

Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children

Natural Language Engineering ◽

10.1017/s135132491600005x ◽

2016 ◽

Vol 23 (3) ◽

pp. 325-350 ◽

Cited By ~ 15

Author(s):

ROMAIN SERIZEL ◽

DIEGO GIULIANI

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Error Rate ◽

Deep Neural Network ◽

Vocal Tract ◽

Rate Performance ◽

Posterior Probabilities ◽

Mel Frequency Cepstral Coefficients ◽

Heterogeneous Groups ◽

Cepstral Coefficients

AbstractThis paper introduces deep neural network (DNN)–hidden Markov model (HMM)-based methods to tackle speech recognition in heterogeneous groups of speakers including children. We target three speaker groups consisting of children, adult males and adult females. Two different kind of approaches are introduced here: approaches based on DNN adaptation and approaches relying on vocal-tract length normalisation (VTLN). First, the recent approach that consists in adapting a general DNN to domain/language specific data is extended to target age/gender groups in the context of DNN–HMM. Then, VTLN is investigated by training a DNN–HMM system by using either mel frequency cepstral coefficients normalised with standard VTLN or mel frequency cepstral coefficients derived acoustic features combined with the posterior probabilities of the VTLN warping factors. In this later, novel, approach the posterior probabilities of the warping factors are obtained with a separate DNN and the decoding can be operated in a single pass when the VTLN approach requires two decoding passes. Finally, the different approaches presented here are combined to take advantage of their complementarity. The combination of several approaches is shown to improve the baseline phone error rate performance by thirty per cent to thirty-five per cent relative and the baseline word error rate performance by about ten per cent relative.

Download Full-text

USING NONLINEAR MODELING OF RECONSTRUCTED PHASE SPACE AND FREQUENCY DOMAIN ANALYSIS TO IMPROVE AUTOMATIC SPEECH RECOGNITION PERFORMANCE

International Journal of Bifurcation and Chaos ◽

10.1142/s0218127412500538 ◽

2012 ◽

Vol 22 (03) ◽

pp. 1250053 ◽

Cited By ~ 8

Author(s):

AYYOOB JAFARI ◽

FARSHAD ALMASGANJ

Keyword(s):

Speech Recognition ◽

Phase Space ◽

Recognition Performance ◽

Nonlinear Modeling ◽

Gaussian Mixture ◽

Continuous Speech ◽

Continuous Speech Recognition ◽

Mel Frequency Cepstral Coefficients ◽

Speech Database ◽

Reconstructed Phase Space

This paper introduces a combinational feature extraction approach to improve speech recognition systems. The main idea is to simultaneously benefit from some features obtained from nonlinear modeling applied to speech reconstructed phase space (RPS) and typical Mel frequency Cepstral coefficients (MFCCs) which have a proved role in speech recognition field. With an appropriate dimension, the reconstructed phase space of speech signal is assured to be topologically equivalent to the dynamics of the speech production system, and could therefore include information that may be absent in linear analysis approaches. In the first part of this paper the application of Lyapunov Exponents (LE) and Fractal Dimension as two usual chaotic features in speech recognition are tested and then a short discussion is made on the weakness of these features in speech recognition. In the following, a statistical modeling approach based on Gaussian mixture models (GMMs) is applied to speech RPS. A final pruned feature set is obtained by applying an efficient feature selection approach to the combination of the parameters of the GMM model and MFCC-based features. A hidden Markov model-based (HMM) speech recognition system and TIMIT speech database are used to evaluate the performance of the proposed feature set by conducting isolated and continuous speech recognition experiments. In final Continuous Speech Recognition (CSR) Experiments, using tri-phone models, 3.7% absolute phoneme recognition accuracy improvement against using MFCC features alone were obtained.

Download Full-text

A Method of Speech Coding for Speech Recognition Using a Convolutional Neural Network

Symmetry ◽

10.3390/sym11091185 ◽

2019 ◽

Vol 11 (9) ◽

pp. 1185 ◽

Cited By ~ 4

Author(s):

Mariusz Kubanek ◽

Janusz Bobulski ◽

Joanna Kulawik

Keyword(s):

Neural Network ◽

Neural Networks ◽

Speech Recognition ◽

Network Structure ◽

High Resistance ◽

Vocal Tract ◽

Continuous Speech ◽

Mel Frequency Cepstral Coefficients ◽

New Approach ◽

Neural Network Structure

This work presents a new approach to speech recognition, based on the specific coding of time and frequency characteristics of speech. The research proposed the use of convolutional neural networks because, as we know, they show high resistance to cross-spectral distortions and differences in the length of the vocal tract. Until now, two layers of time convolution and frequency convolution were used. A novel idea is to weave three separate convolution layers: traditional time convolution and the introduction of two different frequency convolutions (mel-frequency cepstral coefficients (MFCC) convolution and spectrum convolution). This application takes into account more details contained in the tested signal. Our idea assumes creating patterns for sounds in the form of RGB (Red, Green, Blue) images. The work carried out research for isolated words and continuous speech, for neural network structure. A method for dividing continuous speech into syllables has been proposed. This method can be used for symmetrical stereo sound.

Download Full-text