scholarly journals Single Channel Speech Enhancement Using Adaptive Soft-Thresholding with Bivariate EMD

2013 ◽  
Vol 2013 ◽  
pp. 1-8 ◽  
Author(s):  
Md. Ekramul Hamid ◽  
Md. Khademul Islam Molla ◽  
Xin Dang ◽  
Takayoshi Nakai

This paper presents a novel data adaptive thresholding approach to single channel speech enhancement. The noisy speech signal and fractional Gaussian noise (fGn) are combined to produce the complex signal. The fGn is generated using the noise variance roughly estimated from the noisy speech signal. Bivariate empirical mode decomposition (bEMD) is employed to decompose the complex signal into a finite number of complex-valued intrinsic mode functions (IMFs). The real and imaginary parts of the IMFs represent the IMFs of observed speech and fGn, respectively. Each IMF is divided into short time frames for local processing. The variance of IMF of fGn calculated within a frame is used as the reference term to classify corresponding noisy speech frame into noise and signal dominant frames. Only the noise dominant frames are soft-thresholded to reduce the noise effects. Then, all the frames as well as IMFs of speech are combined, yielding the enhanced speech signal. The experimental results show the improved performance of the proposed algorithm compared to the recently reported methods.

2011 ◽  
Vol 2011 ◽  
pp. 1-21 ◽  
Author(s):  
Md. Khademul Islam Molla ◽  
Poly Rani Ghosh ◽  
Keikichi Hirose

This paper presents a data adaptive approach for the analysis of climate variability using bivariate empirical mode decomposition (BEMD). The time series of climate factors: daily evaporation, maximum and minimum temperatures are taken into consideration in variability analysis. All climate data are collected from a specific area of Bihar in India. Fractional Gaussian noise (fGn) is used here as the reference signal. The climate signal and fGn (of same length) are combined to produce bivariate (complex) signal which is decomposed using BEMD into a finite number of sub-band signals named intrinsic mode functions (IMFs). Both of climate signal as well as fGn are decomposed together into IMFs. The instantaneous frequencies and Fourier spectrum of IMFs are observed to illustrate the property of BEMD. The lowest frequency oscillation of climate signal represents the annual cycle (AC) which is an important factor in analyzing climate change and variability. The energies of the fGn's IMFs are used to define the data adaptive threshold to separate AC. The IMFs of climate signal with energy exceeding such threshold are summed up to separate the AC. The interannual distance of climate signal is also illustrated for better understanding of climate change and variability.


2021 ◽  
Author(s):  
Suparerk Janjarasjitt

Abstract The preterm birth anticipation is a crucial task that can reduce the rate of preterm birth and also the complications of preterm birth. Electrohysterogram (EHG) or uterine electromyogram (EMG) data have been evidenced that they can provide an information useful for preterm birth anticipation. Four distinct time-domain features, i.e., mean absolute value, average amplitude change, difference absolute standard deviation value, and log detector, commonly applied to EMG signal processing are applied and investigated in this study. A single-channel of EHG data is decomposed into its constituent components, i.e., intrinsic mode functions, using empirical mode decomposition (EMD) before their time-domain features are extracted. The time-domain features of intrinsic mode functions of EHG data associated with preterm and term births are applied for preterm-term birth classification using support vector machine (SVM) with a radial basis function. The preterm-term classifications are validated using 10-fold cross validation. From the computational results, it is shown that the excellent preterm-term birth classification can be achieved using a single-channel of EHG data. The computational results further suggest that the best overall performance on preterm-term birth classification is obtained when thirteen (out of sixteen) EMD-based time-domain features are applied. The best accuracy, sensitivity, specificity, and F1-score achieved are, respectively, 0.9382, 0.9130, 0.9634, and 0.9366.


2013 ◽  
Vol 22 (2) ◽  
pp. 81-93
Author(s):  
Jaya Kumar Ashwini ◽  
Ramaswamy Kumaraswamy

AbstractThis article presents an overview of the single-channel dereverberation methods suitable for distant speech recognition (DSR) application. The dereverberation methods are mainly classified based on the domain of enhancement of speech signal captured by a distant microphone. Many single-channel speech enhancement methods focus on either denoising or dereverberating the distorted speech signal. There are very few methods that consider both noise and reverberation effects. Such methods are discussed under a multistage approach in this article. The article concludes with a hypothesis that the methods that do not require an a priori reverberation impulse response is desirable in varying the environmental conditions for DSR applications such as intelligent home and office environments, humanoid robots, and automobiles rather than the methods that require an a priori reverberation impulse response.


Electronics ◽  
2020 ◽  
Vol 9 (7) ◽  
pp. 1125
Author(s):  
Haitao Lang ◽  
Jie Yang

Recently, supervised learning methods have shown promising performance, especially deep neural network-based (DNN) methods, in the application of single-channel speech enhancement. Generally, those approaches extract the acoustic features directly from the noisy speech to train a magnitude-aware target. In this paper, we propose to extract the acoustic features not only from the noisy speech but also from the pre-estimated speech, noise and phase separately, then fuse them into a new complementary feature for the purpose of obtaining more discriminative acoustic representation. In addition, on the basis of learning a magnitude-aware target, we also utilize the fusion feature to learn a phase-aware target, thereby further improving the accuracy of the recovered speech. We conduct extensive experiments, including performance comparison with some typical existing methods, generalization ability evaluation on unseen noise, ablation study, and subjective test by human listener, to demonstrate the feasibility and effectiveness of the proposed method. Experimental results prove that the proposed method has the ability to improve the quality and intelligibility of the reconstructed speech.


Author(s):  
Poovarasan Selvaraj ◽  
E. Chandra

In Speech Enhancement (SE) techniques, the major challenging task is to suppress non-stationary noises including white noise in real-time application scenarios. Many techniques have been developed for enhancing the vocal signals; however, those were not effective for suppressing non-stationary noises very well. Also, those have high time and resource consumption. As a result, Sliding Window Empirical Mode Decomposition and Hurst (SWEMDH)-based SE method where the speech signal was decomposed into Intrinsic Mode Functions (IMFs) based on the sliding window and the noise factor in each IMF was chosen based on the Hurst exponent data. Also, the least corrupted IMFs were utilized to restore the vocal signal. However, this technique was not suitable for white noise scenarios. Therefore in this paper, a Variant of Variational Mode Decomposition (VVMD) with SWEMDH technique is proposed to reduce the complexity in real-time applications. The key objective of this proposed SWEMD-VVMDH technique is to decide the IMFs based on Hurst exponent and then apply the VVMD technique to suppress both low- and high-frequency noisy factors from the vocal signals. Originally, the noisy vocal signal is decomposed into many IMFs using SWEMDH technique. Then, Hurst exponent is computed to decide the IMFs with low-frequency noisy factors and Narrow-Band Components (NBC) is computed to decide the IMFs with high-frequency noisy factors. Moreover, VVMD is applied on the addition of all chosen IMF to remove both low- and high-frequency noisy factors. Thus, the speech signal quality is improved under non-stationary noises including additive white Gaussian noise. Finally, the experimental outcomes demonstrate the significant speech signal improvement under both non-stationary and white noise surroundings.


Author(s):  
Shifeng Ou ◽  
Peng Song ◽  
Ying Gao

The a priori signal-to-noise ratio (SNR) plays an essential role in many speech enhancement systems. Most of the existing approaches to estimate the a priori SNR only exploit the amplitude spectra while making the phase neglected. Considering the fact that incorporating phase information into a speech processing system can significantly improve the speech quality, this paper proposes a phase-sensitive decision-directed (DD) approach for the a priori SNR estimate. By representing the short-time discrete Fourier transform (STFT) signal spectra geometrically in a complex plane, the proposed approach estimates the a priori SNR using both the magnitude and phase information while making no assumptions about the phase difference between clean speech and noise spectra. Objective evaluations in terms of the spectrograms, segmental SNR, log-spectral distance (LSD) and short-time objective intelligibility (STOI) measures are presented to demonstrate the superiority of the proposed approach compared to several competitive methods at different noise conditions and input SNR levels.


2020 ◽  
Vol 39 (5) ◽  
pp. 6881-6889
Author(s):  
Jie Wang ◽  
Linhuang Yan ◽  
Jiayi Tian ◽  
Minmin Yuan

In this paper, a bilateral spectrogram filtering (BSF)-based optimally modified log-spectral amplitude (OMLSA) estimator for single-channel speech enhancement is proposed, which can significantly improve the performance of OMLSA, especially in highly non-stationary noise environments, by taking advantage of bilateral filtering (BF), a widely used technology in image and visual processing, to preprocess the spectrogram of the noisy speech. BSF is capable of not only sharpening details, removing unwanted textures or background noise from the noisy speech spectrogram, but also preserving edges when considering a speech spectrogram as an image. The a posteriori signal-to-noise ratio (SNR) of OMLSA algorithm is estimated after applying BSF to the noisy speech. Besides, in order to reduce computing costs, a fast and accurate BF is adopted to reduce the algorithm complexity O(1) for each time-frequency bin. Finally, the proposed algorithm is compared with the original OMLSA and other classic denoising methods using various types of noise with different signal-to-noise ratios in terms of objective evaluation metrics such as segmental signal-to-noise ratio improvement and perceptual evaluation of speech quality. The results show the validity of the improved BSF-based OMLSA algorithm.


2014 ◽  
Vol 2014 ◽  
pp. 1-11 ◽  
Author(s):  
Erhan Deger ◽  
Md. Khademul Islam Molla ◽  
Keikichi Hirose ◽  
Nobuaki Minematsu ◽  
Md. Kamrul Hasan

This paper presents a two-stage soft thresholding algorithm based on discrete cosine transform (DCT) and empirical mode decomposition (EMD). In the first stage, noisy speech is decomposed into eight frequency bands and a specific noise variance is calculated for each one. Based on this variance, each band is denoised using soft thresholding in DCT domain. The remaining noise is eliminated in the second stage through a time domain soft thresholding strategy adapted to the intrinsic mode functions (IMFs) derived by applying EMD on the signal obtained from the first stage processing. Significantly better SNR improvement and perceptual speech quality results for different noise types prove the superiority of the proposed algorithm over recently reported techniques.


Author(s):  
Xianyun Wang ◽  
Changchun Bao

AbstractAccording to the encoding and decoding mechanism of binaural cue coding (BCC), in this paper, the speech and noise are considered as left channel signal and right channel signal of the BCC framework, respectively. Subsequently, the speech signal is estimated from noisy speech when the inter-channel level difference (ICLD) and inter-channel correlation (ICC) between speech and noise are given. In this paper, exact inter-channel cues and the pre-enhanced inter-channel cues are used for speech restoration. The exact inter-channel cues are extracted from clean speech and noise, and the pre-enhanced inter-channel cues are extracted from the pre-enhanced speech and estimated noise. After that, they are combined one by one to form a codebook. Once the pre-enhanced cues are extracted from noisy speech, the exact cues are estimated by a mapping between the pre-enhanced cues and a prior codebook. Next, the estimated exact cues are used to obtain a time-frequency (T-F) mask for enhancing noisy speech based on the decoding of BCC. In addition, in order to further improve accuracy of the T-F mask based on the inter-channel cues, the deep neural network (DNN)-based method is proposed to learn the mapping relationship between input features of noisy speech and the T-F masks. Experimental results show that the codebook-driven method can achieve better performance than conventional methods, and the DNN-based method performs better than the codebook-driven method.


Sign in / Sign up

Export Citation Format

Share Document