Robust Feature Vector Set Using Higher Order Autocorrelation Coefficients

Author(s):  
Poonam Bansal ◽  
Amita Dev ◽  
Shail Jain

In this paper, a feature extraction method that is robust to additive background noise is proposed for automatic speech recognition. Since the background noise corrupts the autocorrelation coefficients of the speech signal mostly at the lower orders, while the higher-order autocorrelation coefficients are least affected, this method discards the lower order autocorrelation coefficients and uses only the higher-order autocorrelation coefficients for spectral estimation. The magnitude spectrum of the windowed higher-order autocorrelation sequence is used here as an estimate of the power spectrum of the speech signal. This power spectral estimate is processed further by the Mel filter bank; a log operation and the discrete cosine transform to get the cepstral coefficients. These cepstral coefficients are referred to as the Differentiated Relative Higher Order Autocorrelation Coefficient Sequence Spectrum (DRHOASS). The authors evaluate the speech recognition performance of the DRHOASS features and show that they perform as well as the MFCC features for clean speech and their recognition performance is better than the MFCC features for noisy speech.

Author(s):  
Poonam Bansal ◽  
Amita Dev ◽  
Shail Jain

In this paper, a feature extraction method that is robust to additive background noise is proposed for automatic speech recognition. Since the background noise corrupts the autocorrelation coefficients of the speech signal mostly at the lower orders, while the higher-order autocorrelation coefficients are least affected, this method discards the lower order autocorrelation coefficients and uses only the higher-order autocorrelation coefficients for spectral estimation. The magnitude spectrum of the windowed higher-order autocorrelation sequence is used here as an estimate of the power spectrum of the speech signal. This power spectral estimate is processed further by the Mel filter bank; a log operation and the discrete cosine transform to get the cepstral coefficients. These cepstral coefficients are referred to as the Differentiated Relative Higher Order Autocorrelation Coefficient Sequence Spectrum (DRHOASS). The authors evaluate the speech recognition performance of the DRHOASS features and show that they perform as well as the MFCC features for clean speech and their recognition performance is better than the MFCC features for noisy speech.


Author(s):  
Kai Zhao ◽  
Dan Wang

Aiming at the problem of low recognition rate in speech recognition methods, a speech recognition method in multi-layer perceptual network environment is proposed. In the multi-layer perceptual network environment, the speech signal is processed in the filter by using the transfer function of the filter. According to the framing process, the speech signal is windowed and framing processed to remove the silence segment of the speech signal. At the same time, the average energy of the speech signal is calculated and the zero crossing rate is calculated to extract the characteristics of the speech signal. By analyzing the principle of speech signal recognition, the process of speech recognition is designed, and the speech recognition in multi-layer perceptual network environment is realized. The experimental results show that the speech recognition method designed in this paper has good speech recognition performance


Author(s):  
Shae D. Morgan

Purpose Word recognition in quiet and in background noise has been thoroughly investigated in previous research to establish segmental speech recognition performance as a function of stimulus characteristics (e.g., audibility). Similar methods to investigate recognition performance for suprasegmental information (e.g., acoustic cues used to make judgments of talker age, sex, or emotional state) have not been performed. In this work, we directly compared emotion and word recognition performance in different levels of background noise to identify psychoacoustic properties of emotion recognition (globally and for specific emotion categories) relative to word recognition. Method Twenty young adult listeners with normal hearing listened to sentences and either reported a target word in each sentence or selected the emotion of the talker from a list of options (angry, calm, happy, and sad) at four signal-to-noise ratios in a background of white noise. Psychometric functions were fit to the recognition data and used to estimate thresholds (midway points on the function) and slopes for word and emotion recognition. Results Thresholds for emotion recognition were approximately 10 dB better than word recognition thresholds, and slopes for emotion recognition were half of those measured for word recognition. Low-arousal emotions had poorer thresholds and shallower slopes than high-arousal emotions, suggesting greater confusion when distinguishing low-arousal emotional speech content. Conclusions Communication of a talker's emotional state continues to be perceptible to listeners in competitive listening environments, even after words are rendered inaudible. The arousal of emotional speech affects listeners' ability to discriminate between emotion categories.


1988 ◽  
Vol 31 (4) ◽  
pp. 681-695 ◽  
Author(s):  
Faith C. Loven ◽  
M. Jane Collins

The purpose of this investigation was to describe the interactive effects of four signal modifications typically encountered in everyday communication settings. These modifications included reverberation, masking, filtering, and fluctuation in speech intensity. The relationship between recognition performance and spectral changes to the speech signal due to the presence of these signal alterations was also studied. The interactive effects of these modifications were evaluated by obtaining indices of nonsense syllable recognition ability from normally hearing listeners for systematically varied combinations of the four signal parameters. The results of this study were in agreement with previous studies concerned with the effect of these variables in isolation on speech recognition ability. When present in combination, the direction of each variable's effect on recognition performance is maintained; however, the magnitude of the effect increases. The results of this investigation are reasonably accounted for by a spectral theory of speech recognition.


2013 ◽  
Vol 2013 ◽  
pp. 1-11 ◽  
Author(s):  
Sanaz Seyedin ◽  
Seyed Mohammad Ahadi ◽  
Saeed Gazor

This paper presents a novel noise-robust feature extraction method for speech recognition using the robust perceptual minimum variance distortionless response (MVDR) spectrum of temporally filtered autocorrelation sequence. The perceptual MVDR spectrum of the filtered short-time autocorrelation sequence can reduce the effects of residue of the nonstationary additive noise which remains after filtering the autocorrelation. To achieve a more robust front-end, we also modify the robust distortionless constraint of the MVDR spectral estimation method via revised weighting of the subband power spectrum values based on the sub-band signal to noise ratios (SNRs), which adjusts it to the new proposed approach. This new function allows the components of the input signal at the frequencies least affected by noise to pass with larger weights and attenuates more effectively the noisy and undesired components. This modification results in reduction of the noise residuals of the estimated spectrum from the filtered autocorrelation sequence, thereby leading to a more robust algorithm. Our proposed method, when evaluated on Aurora 2 task for recognition purposes, outperformed all Mel frequency cepstral coefficients (MFCC) as the baseline, relative autocorrelation sequence MFCC (RAS-MFCC), and the MVDR-based features in several different noisy conditions.


2015 ◽  
Vol 40 (1) ◽  
pp. 25-31 ◽  
Author(s):  
Sayf A. Majeed ◽  
Hafizah Husain ◽  
Salina A. Samad

Abstract In this paper, a new feature-extraction method is proposed to achieve robustness of speech recognition systems. This method combines the benefits of phase autocorrelation (PAC) with bark wavelet transform. PAC uses the angle to measure correlation instead of the traditional autocorrelation measure, whereas the bark wavelet transform is a special type of wavelet transform that is particularly designed for speech signals. The extracted features from this combined method are called phase autocorrelation bark wavelet transform (PACWT) features. The speech recognition performance of the PACWT features is evaluated and compared to the conventional feature extraction method mel frequency cepstrum coefficients (MFCC) using TI-Digits database under different types of noise and noise levels. This database has been divided into male and female data. The result shows that the word recognition rate using the PACWT features for noisy male data (white noise at 0 dB SNR) is 60%, whereas it is 41.35% for the MFCC features under identical conditions


2005 ◽  
Vol 16 (09) ◽  
pp. 726-739 ◽  
Author(s):  
Rachel A. McArdle ◽  
Richard H. Wilson ◽  
Christopher A. Burks

The purpose of this mixed model design was to examine recognition performance differences when measuring speech recognition in multitalker babble on listeners with normal hearing (n = 36) and listeners with hearing loss (n = 72) utilizing stimulus of varying linguistic complexity (digits, words, and sentence materials). All listeners were administered two trials of two lists of each material in a descending speech-to-babble ratio. For each of the materials, recognition performances by the listeners with normal hearing were significantly better than the performances by the listeners with hearing loss. The mean separation between groups at the 50% point in signal-to-babble ratio on each of the three materials was ~8 dB. The 50% points for digits were obtained at a significantly lower signal-to-babble ratio than for sentences or words that were equivalent. There were no interlist differences between the two lists for the digits and words, but there was a significant disparity between QuickSIN™ lists for the listeners with hearing loss. A two-item questionnaire was used to obtain a subjective measurement of speech recognition, which showed moderate correlations with objective measures of speech recognition in noise using digits (r = .641), sentences (r = .572), and words (r = .673).


Author(s):  
Imad Qasim Habeeb ◽  
Tamara Z. Fadhil ◽  
Yaseen Naser Jurn ◽  
Zeyad Qasim Habeeb ◽  
Hanan Najm Abdulkhudhur

<span>Automatic speech recognition (ASR) is a technology that allows a computer and mobile device to recognize and translate spoken language into text. ASR systems often produce poor accuracy for the noisy speech signal. Therefore, this research proposed an ensemble technique that does not rely on a single filter for perfect noise reduction but incorporates information from multiple noise reduction filters to improve the final ASR accuracy. The main factor of this technique is the generation of K-copies of the speech signal using three noise reduction filters. The speech features of these copies differ slightly in order to extract different texts from them when processed by the ASR system. Thus, the best among these texts can be elected as final ASR output. The ensemble technique was compared with three related current noise reduction techniques in terms of CER and WER. The test results were encouraging and showed a relatively decreased by 16.61% and 11.54% on CER and WER compared with the best current technique. ASR field will benefit from the contribution of this research to increase the recognition accuracy of a human speech in the presence of background noise.</span>


Author(s):  
Ribwar Bakhtyar Ibrahim

Speech recognition has gained much attention from researchers for almost last two decades. Isolated words, connected words, and continuous speech are the main focused areas of speech recognition. Researchers have adopted many techniques to solve speech recognition challenges under the umbrella of Artificial Intelligence (AI), Pattern Recognition and Acoustic Phonetic approaches. Variation in pronunciation of words, individual accents, unwanted ambient noise, speech context, and quality of input devices are some of these challenges in speech recognition. Many Application Programming Interface (API)s are developed to overcome the issue of accuracy in a speech-to-text conversion such as Microsoft Speech API and Google Speech API. In this paper, the performance of Microsoft Speech API is analyzed against other Speech APIs mentioned in the literature on the special dataset (without background noise) prepared. A Voice Interactive Speech to Text (VIST) audio player was developed for the analysis of Microsoft Speech API. VIST audio player creates runtime subtitles of the audio files running on it; the player is responsible for speech to text conversion in real-time. Microsoft Speech API was incorporated in the application to validate and make the performance of API measurable. The experiments proved the Microsoft Speech API more accurate with respect to other APIs in the context of the prepared dataset for the VIST audio player. The accuracy rate according to the precision-recall is 96% for Microsoft Speech API, which is better than previous ones as mentioned in the literature.


Sign in / Sign up

Export Citation Format

Share Document