Automatic Language Identification Using Speech Rhythm Features for Multi-Lingual Speech Recognition

Hwamin Kim; Jeong-Sik Park

doi:10.3390/app10072225

Automatic Language Identification Using Speech Rhythm Features for Multi-Lingual Speech Recognition

Applied Sciences ◽

10.3390/app10072225 ◽

2020 ◽

Vol 10 (7) ◽

pp. 2225

Author(s):

Hwamin Kim ◽

Jeong-Sik Park

Keyword(s):

Speech Recognition ◽

Gaussian Mixture ◽

Language Identification ◽

Speech Signals ◽

General Tendency ◽

Specific Information ◽

Acoustic Features ◽

Computation Complexity ◽

Speech Rhythm ◽

Speech Corpora

The conventional speech recognition systems can handle the input speech of a specific single language. To realize multi-lingual speech recognition, a language should be firstly identified from input speech. This study proposes an efficient Language IDentification (LID) approach for the multi-lingual system. The standard LID tasks depend on common acoustic features used in speech recognition. However, the features may convey insufficient language-specific information, as they aim to discriminate the general tendency of phonemic information. This study investigates another type of feature characterizing language-specific properties, considering computation complexity. We focus on speech rhythm features providing the prosodic characteristics of speech signals. The rhythm features represent the tendency of consonants and vowels of languages, and therefore, classifying them from speech signals is necessary. For the rapid classification, we employ Gaussian Mixture Model (GMM)-based learning in which two GMMs corresponding to consonants and vowels are firstly trained and used for classifying them. By using the classification results, we estimate the tendency of two phonemic groups such as the duration of consonantal and vocalic intervals and calculate rhythm metrics called R-vector. In experiments on several speech corpora, the automatically extracted R-vector provided similar language tendencies to the conventional studies on linguistics. In addition, the proposed R-vector-based LID approach demonstrated superior or comparable LID performance to the conventional approaches in spite of low computation complexity.

Download Full-text

The Effect of Audiovisual and Binaural Listening on the Acceptable Noise Level (ANL): Establishing an ANL Conceptual Model

Journal of the American Academy of Audiology ◽

10.3766/jaaa.25.2.3 ◽

2014 ◽

Vol 25 (02) ◽

pp. 141-153 ◽

Cited By ~ 4

Author(s):

Yu-Hsiang Wu ◽

Elizabeth Stangl ◽

Carol Pang ◽

Xuyang Zhang

Keyword(s):

Speech Recognition ◽

Noise Level ◽

Speech Intelligibility ◽

Recognition Performance ◽

Speech Signals ◽

Acoustic Features ◽

Listening Condition ◽

Interaural Correlation ◽

Acceptable Noise Level ◽

Binaural Listening

Background: Little is known regarding the acoustic features of a stimulus used by listeners to determine the acceptable noise level (ANL). Features suggested by previous research include speech intelligibility (noise is unacceptable when it degrades speech intelligibility to a certain degree; the intelligibility hypothesis) and loudness (noise is unacceptable when the speech-to-noise loudness ratio is poorer than a certain level; the loudness hypothesis). Purpose: The purpose of the study was to investigate if speech intelligibility or loudness is the criterion feature that determines ANL. To achieve this, test conditions were chosen so that the intelligibility and loudness hypotheses would predict different results. In Experiment 1, the effect of audiovisual (AV) and binaural listening on ANL was investigated; in Experiment 2, the effect of interaural correlation (ρ) on ANL was examined. Research Design: A single-blinded, repeated-measures design was used. Study Sample: Thirty-two and twenty-five younger adults with normal hearing participated in Experiments 1 and 2, respectively. Data Collection and Analysis: In Experiment 1, both ANL and speech recognition performance were measured using the AV version of the Connected Speech Test (CST) in three conditions: AV-binaural, auditory only (AO)-binaural, and AO-monaural. Lipreading skill was assessed using the Utley lipreading test. In Experiment 2, ANL and speech recognition performance were measured using the Hearing in Noise Test (HINT) in three binaural conditions, wherein the interaural correlation of noise was varied: ρ = 1 (NoSo [a listening condition wherein both speech and noise signals are identical across two ears]), −1 (NπSo [a listening condition wherein speech signals are identical across two ears whereas the noise signals of two ears are 180 degrees out of phase]), and 0 (NuSo [a listening condition wherein speech signals are identical across two ears whereas noise signals are uncorrelated across ears]). The results were compared to the predictions made based on the intelligibility and loudness hypotheses. Results: The results of the AV and AO conditions appeared to support the intelligibility hypothesis due to the significant correlation between visual benefit in ANL (AV re: AO ANL) and (1) visual benefit in CST performance (AV re: AO CST) and (2) lipreading skill. The results of the NoSo, NπSo, and NuSo conditions negated the intelligibility hypothesis because binaural processing benefit (NπSo re: NoSo, and NuSo re: NoSo) in ANL was not correlated to that in HINT performance. Instead, the results somewhat supported the loudness hypothesis because the pattern of ANL results across the three conditions (NoSo ≈ NπSo ≈ NuSo ANL) was more consistent with what was predicted by the loudness hypothesis (NoSo ≈ NπSo < NuSo ANL) than by the intelligibility hypothesis (NπSo < NuSo < NoSo ANL). The results of the binaural and monaural conditions supported neither hypothesis because (1) binaural benefit (binaural re: monaural) in ANL was not correlated to that in speech recognition performance, and (2) the pattern of ANL results across conditions (binaural < monaural ANL) was not consistent with the prediction made based on previous binaural loudness summation research (binaural ≥ monaural ANL). Conclusions: The study suggests that listeners may use multiple acoustic features to make ANL judgments. The binaural/monaural results showing that neither hypothesis was supported further indicate that factors other than speech intelligibility and loudness, such as psychological factors, may affect ANL. The weightings of different acoustic features in ANL judgments may vary widely across individuals and listening conditions.

Download Full-text

Use of Global and Acoustic Features Associated with Contextual Factors to Adapt Language Models for Spontaneous Speech Recognition

10.21437/interspeech.2017-717 ◽

2017 ◽

Cited By ~ 3

Author(s):

Shohei Toyama ◽

Daisuke Saito ◽

Nobuaki Minematsu

Keyword(s):

Speech Recognition ◽

Contextual Factors ◽

Spontaneous Speech ◽

Language Models ◽

Acoustic Features

Download Full-text

Non-parallel dictionary learning for voice conversion using non-negative Tucker decomposition

EURASIP Journal on Audio Speech and Music Processing ◽

10.1186/s13636-019-0160-1 ◽

2019 ◽

Vol 2019 (1) ◽

Author(s):

Yuki Takashima ◽

Toru Nakashika ◽

Tetsuya Takiguchi ◽

Yasuo Ariki

Keyword(s):

Dictionary Learning ◽

Computational Cost ◽

Tensor Decomposition ◽

Gaussian Mixture ◽

Voice Conversion ◽

Specific Information ◽

Learning Method ◽

Tucker Decomposition ◽

Parallel Data ◽

High Computational Cost

Abstract Voice conversion (VC) is a technique of exclusively converting speaker-specific information in the source speech while preserving the associated phonemic information. Non-negative matrix factorization (NMF)-based VC has been widely researched because of the natural-sounding voice it achieves when compared with conventional Gaussian mixture model-based VC. In conventional NMF-VC, models are trained using parallel data which results in the speech data requiring elaborate pre-processing to generate parallel data. NMF-VC also tends to be an extensive model as this method has several parallel exemplars for the dictionary matrix, leading to a high computational cost. In this study, an innovative parallel dictionary-learning method using non-negative Tucker decomposition (NTD) is proposed. The proposed method uses tensor decomposition and decomposes an input observation into a set of mode matrices and one core tensor. The proposed NTD-based dictionary-learning method estimates the dictionary matrix for NMF-VC without using parallel data. The experimental results show that the proposed method outperforms other methods in both parallel and non-parallel settings.

Download Full-text

The subspace Gaussian mixture model—A structured model for speech recognition

Computer Speech & Language ◽

10.1016/j.csl.2010.06.003 ◽

2011 ◽

Vol 25 (2) ◽

pp. 404-439 ◽

Cited By ~ 151

Author(s):

Daniel Povey ◽

Lukáš Burget ◽

Mohit Agarwal ◽

Pinar Akyazi ◽

Feng Kai ◽

...

Keyword(s):

Speech Recognition ◽

Gaussian Mixture Model ◽

Mixture Model ◽

Gaussian Mixture ◽

Structured Model

Download Full-text

Enhancing robustness of zero resource children's speech recognition system through bispectrum based front-end acoustic features

Digital Signal Processing ◽

10.1016/j.dsp.2021.103226 ◽

2021 ◽

pp. 103226

Author(s):

S. Shahnawazuddin ◽

Avinash Kumar ◽

Saurabh Kumar ◽

Waquar Ahmad

Keyword(s):

Speech Recognition ◽

Recognition System ◽

Speech Recognition System ◽

Acoustic Features ◽

Front End ◽

Children’S Speech Recognition ◽

Children's Speech

Download Full-text

Language identification from visual-only speech signals

Attention Perception & Psychophysics ◽

10.3758/app.72.6.1601 ◽

2010 ◽

Vol 72 (6) ◽

pp. 1601-1613 ◽

Cited By ~ 18

Author(s):

Rebecca E. Ronquest ◽

Susannah V. Levi ◽

David B. Pisoni

Keyword(s):

Language Identification ◽

Speech Signals

Download Full-text

A method of extracting time-varying acoustic features effective for speech recognition

1997 IEEE International Conference on Acoustics, Speech, and Signal Processing ◽

10.1109/icassp.1997.596207 ◽

2002 ◽

Cited By ~ 3

Author(s):

K. Tanaka ◽

H. Kojima

Keyword(s):

Speech Recognition ◽

Time Varying ◽

Acoustic Features

Download Full-text

Nonmonotonic Generalization Bias of Gaussian Mixture Models

Neural Computation ◽

10.1162/089976600300015439 ◽

2000 ◽

Vol 12 (6) ◽

pp. 1411-1427 ◽

Cited By ~ 8

Author(s):

Shotaro Akaho ◽

Hilbert J. Kappen

Keyword(s):

Symmetry Breaking ◽

Gaussian Mixture Models ◽

Information Criterion ◽

Gaussian Mixture ◽

General Tendency ◽

Adaptive Parameters ◽

Neural Information ◽

The Difference ◽

Training Error ◽

Validation Experiments

Theories of learning and generalization hold that the generalization bias, defined as the difference between the training error and the generalization error, increases on average with the number of adaptive parameters. This article, however, shows that this general tendency is violated for a gaussian mixture model. For temperatures just below the first symmetry breaking point, the effective number of adaptive parameters increases and the generalization bias decreases. We compute the dependence of the neural information criterion on temperature around the symmetry breaking. Our results are confirmed by numerical cross-validation experiments.

Download Full-text

UCSY-SC1: A Myanmar speech corpus for automatic speech recognition

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v9i4.pp3194-3202 ◽

2019 ◽

Vol 9 (4) ◽

pp. 3194 ◽

Cited By ~ 1

Author(s):

Aye Nyein Mon ◽

Win Pa Pa ◽

Ye Kyaw Thu

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Gaussian Mixture ◽

Error Rates ◽

Training Data ◽

Speech Corpus ◽

Total Size ◽

Test Sets ◽

Web News

This paper introduces a speech corpus which is developed for Myanmar Automatic Speech Recognition (ASR) research. Automatic Speech Recognition (ASR) research has been conducted by the researchers around the world to improve their language technologies. Speech corpora are important in developing the ASR and the creation of the corpora is necessary especially for low-resourced languages. Myanmar language can be regarded as a low-resourced language because of lack of pre-created resources for speech processing research. In this work, a speech corpus named UCSY-SC1 (University of Computer Studies Yangon - Speech Corpus1) is created for Myanmar ASR research. The corpus consists of two types of domain: news and daily conversations. The total size of the speech corpus is over 42 hrs. There are 25 hrs of web news and 17 hrs of conversational recorded data.<br />The corpus was collected from 177 females and 84 males for the news data and 42 females and 4 males for conversational domain. This corpus was used as training data for developing Myanmar ASR. Three different types of acoustic models such as Gaussian Mixture Model (GMM) - Hidden Markov Model (HMM), Deep Neural Network (DNN), and Convolutional Neural Network (CNN) models were built and compared their results. Experiments were conducted on different data sizes and evaluation is done by two test sets: TestSet1, web news and TestSet2, recorded conversational data. It showed that the performance of Myanmar ASRs using this corpus gave satisfiable results on both test sets. The Myanmar ASR using this corpus leading to word error rates of 15.61% on TestSet1 and 24.43% on TestSet2.<br /><br />

Download Full-text

Third-Order Moments of Filtered Speech Signals for Robust Speech Recognition

Nonlinear Analyses and Algorithms for Speech Processing - Lecture Notes in Computer Science ◽

10.1007/11613107_24 ◽

2006 ◽

pp. 277-283 ◽

Cited By ~ 3

Author(s):

Kevin M. Indrebo ◽

Richard J. Povinelli ◽

Michael T. Johnson

Keyword(s):

Speech Recognition ◽

Speech Signals ◽

Robust Speech Recognition ◽

Third Order ◽

Filtered Speech

Download Full-text