Integrating Dilated Convolution into DenseLSTM for Audio Source Separation

Woon-Haeng Heo; Hyemi Kim; Oh-Wook Kwon

doi:10.3390/app11020789

Integrating Dilated Convolution into DenseLSTM for Audio Source Separation

Applied Sciences ◽

10.3390/app11020789 ◽

2021 ◽

Vol 11 (2) ◽

pp. 789

Author(s):

Woon-Haeng Heo ◽

Hyemi Kim ◽

Oh-Wook Kwon

Keyword(s):

Deep Learning ◽

Speech Signal ◽

Source Separation ◽

Series Data ◽

Separation Performance ◽

Time Frequency ◽

Dilated Convolution ◽

Audio Source Separation ◽

Music Signal ◽

Learning Architectures

Herein, we proposed a multi-scale multi-band dilated time-frequency densely connected convolutional network (DenseNet) with long short-term memory (LSTM) for audio source separation. Because the spectrogram of the acoustic signal can be thought of as images as well as time series data, it is suitable for convolutional recurrent neural network (CRNN) architecture. We improved the audio source separation performance by applying the dilated block with a dilated convolution to CRNN architecture. The dilated block has the role of effectively increasing the receptive field in the spectrogram. In addition, it was designed in consideration of the acoustic characteristics that the frequency axis and the time axis in the spectrogram are changed by independent influences such as speech rate and pitch. In speech enhancement experiments, we estimated the speech signal using various deep learning architectures from a signal in which the music, noise, and speech were mixed. We conducted the subjective evaluation on the estimated speech signal. In addition, speech quality, intelligibility, separation, and speech recognition performance were also measured. In music signal separation, we estimated the music signal using several deep learning architectures from the mixture of the music and speech signal. After that, the separation performance and music identification accuracy were measured using the estimated music signal. Overall, the proposed architecture shows the best performance compared to other deep learning architectures not only in speech experiments but also in music experiments.

Download Full-text

Low Latency Convolutive Blind Source Separation

10.26686/wgtn.17136158 ◽

2021 ◽

Author(s):

◽

Jiawen Chua

Keyword(s):

Frequency Domain ◽

Real Time ◽

Impulse Response ◽

Source Separation ◽

Frequency Resolution ◽

Separation Performance ◽

Window Length ◽

Time Frequency ◽

Time Systems ◽

Separation Parameters

<p>In most real-time systems, particularly for applications involving system identification, latency is a critical issue. These applications include, but are not limited to, blind source separation (BSS), beamforming, speech dereverberation, acoustic echo cancellation and channel equalization. The system latency consists of an algorithmic delay and an estimation computational time. The latter can be avoided by using a multi-thread system, which runs the estimation process and the processing procedure simultaneously. The former, which consists of a delay of one window length, is usually unavoidable for the frequency-domain approaches. For frequency-domain approaches, a block of data is acquired by using a window, transformed and processed in the frequency domain, and recovered back to the time domain by using an overlap-add technique. In the frequency domain, the convolutive model, which is usually used to describe the process of a linear time-invariant (LTI) system, can be represented by a series of multiplicative models to facilitate estimation. To implement frequency-domain approaches in real-time applications, the short-time Fourier transform (STFT) is commonly used. The window used in the STFT must be at least twice the room impulse response which is long, so that the multiplicative model is sufficiently accurate. The delay constraint caused by the associated blockwise processing window length makes most the frequency-domain approaches inapplicable for real-time systems. This thesis aims to design a BSS system that can be used in a real-time scenario with minimal latency. Existing BSS approaches can be integrated into our system to perform source separation with low delay without affecting the separation performance. The second goal is to design a BSS system that can perform source separation in a non-stationary environment. We first introduce a subspace approach to directly estimate the separation parameters in the low-frequency-resolution time-frequency (LFRTF) domain. In the LFRTF domain, a shorter window is used to reduce the algorithmic delay of the system during the signal acquisition, e.g., the window length is shorter than the room impulse response. The subspace method facilitates the deconvolution of a convolutive mixture to a new instantaneous mixture and simplifies the estimation process. Second, we propose an alternative approach to address the algorithmic latency problem. The alternative method enables us to obtain the separation parameters in the LFRTF domain based on parameters estimated in the high-frequency-resolution time-frequency (HFRTF) domain, where the window length is longer than the room impulse response, without affecting the separation performance. The thesis also provides a solution to address the BSS problem in a non-stationary environment. We utilize the ``meta-information" that is obtained from previous BSS operations to facilitate the separation in the future without performing the entire BSS process again. Repeating a BSS process can be computationally expensive. Most conventional BSS algorithms require sufficient signal samples to perform analysis and this prolongs the estimation delay. By utilizing information from the entire spectrum, our method enables us to update the separation parameters with only a single snapshot of observation data. Hence, our method minimizes the estimation period, reduces the redundancy and improves the efficacy of the system. The final contribution of the thesis is a non-iterative method for impulse response shortening. This method allows us to use a shorter representation to approximate the long impulse response. It further improves the computational efficiency of the algorithm and yet achieves satisfactory performance.</p>

Download Full-text

Audio Source Separation using Sparse Representations

Machine Audition ◽

10.4018/978-1-61520-919-4.ch010 ◽

2010 ◽

pp. 246-265 ◽

Cited By ~ 1

Author(s):

Andrew Nesbit ◽

Maria G. Jafar ◽

Emmanuel Vincent ◽

Mark D. Plumbley

Keyword(s):

Source Separation ◽

Feedback System ◽

Decomposition Methods ◽

Audio Coding ◽

Future Research ◽

Separation Performance ◽

Audio Source Separation ◽

Sparse Component Analysis ◽

Good Signal ◽

Music Signals

The authors address the problem of audio source separation, namely, the recovery of audio signals from recordings of mixtures of those signals. The sparse component analysis framework is a powerful method for achieving this. Sparse orthogonal transforms, in which only few transform coefficients differ significantly from zero, are developed; once the signal has been transformed, energy is apportioned from each transform coefficient to each estimated source, and, finally, the signal is reconstructed using the inverse transform. The overriding aim of this chapter is to demonstrate how this framework, as exemplified here by two different decomposition methods which adapt to the signal to represent it sparsely, can be used to solve different problems in different mixing scenarios. To address the instantaneous (neither delays nor echoes) and underdetermined (more sources than mixtures) mixing model, a lapped orthogonal transform is adapted to the signal by selecting a basis from a library of predetermined bases. This method is highly related to the windowing methods used in the MPEG audio coding framework. In considering the anechoic (delays but no echoes) and determined (equal number of sources and mixtures) mixing case, a greedy adaptive transform is used based on orthogonal basis functions that are learned from the observed data, instead of being selected from a predetermined library of bases. This is found to encode the signal characteristics, by introducing a feedback system between the bases and the observed data. Experiments on mixtures of speech and music signals demonstrate that these methods give good signal approximations and separation performance, and indicate promising directions for future research.

Download Full-text

Unsupervised Learning for Monaural Source Separation Using Maximization–Minimization Algorithm with Time–Frequency Deconvolution

Sensors ◽

10.3390/s18051371 ◽

2018 ◽

Vol 18 (5) ◽

pp. 1371 ◽

Cited By ~ 5

Author(s):

Wai Lok Woo ◽

Bin Gao ◽

Ahmed Bouridane ◽

Bingo Wing-Kuen Ling ◽

Cheng Siong Chin

Keyword(s):

Unsupervised Learning ◽

Single Channel ◽

Learning Algorithm ◽

Source Separation ◽

Nonnegative Matrix ◽

Least Square ◽

Separation Performance ◽

Time Frequency ◽

Special Cases ◽

Leibler Divergence

This paper presents an unsupervised learning algorithm for sparse nonnegative matrix factor time–frequency deconvolution with optimized fractional β-divergence. The β-divergence is a group of cost functions parametrized by a single parameter β. The Itakura–Saito divergence, Kullback–Leibler divergence and Least Square distance are special cases that correspond to β=0, 1, 2, respectively. This paper presents a generalized algorithm that uses a flexible range of β that includes fractional values. It describes a maximization–minimization (MM) algorithm leading to the development of a fast convergence multiplicative update algorithm with guaranteed convergence. The proposed model operates in the time–frequency domain and decomposes an information-bearing matrix into two-dimensional deconvolution of factor matrices that represent the spectral dictionary and temporal codes. The deconvolution process has been optimized to yield sparse temporal codes through maximizing the likelihood of the observations. The paper also presents a method to estimate the fractional β value. The method is demonstrated on separating audio mixtures recorded from a single channel. The paper shows that the extraction of the spectral dictionary and temporal codes is significantly more efficient by using the proposed algorithm and subsequently leads to better source separation performance. Experimental tests and comparisons with other factorization methods have been conducted to verify its efficacy.

Download Full-text

Benchmarking flexible adaptive time-frequency transforms for underdetermined audio source separation

2009 IEEE International Conference on Acoustics, Speech and Signal Processing ◽

10.1109/icassp.2009.4959514 ◽

2009 ◽

Cited By ~ 3

Author(s):

Andrew Nesbit ◽

Emmanuel Vincent ◽

Mark D. Plumbley

Keyword(s):

Source Separation ◽

Time Frequency ◽

Audio Source Separation ◽

Adaptive Time

Download Full-text

Sensing ecosystem dynamics via audio source separation: A case study of marine soundscapes off northeastern Taiwan

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008698 ◽

2021 ◽

Vol 17 (2) ◽

pp. e1008698

Author(s):

Tzu-Hao Lin ◽

Tomonari Akamatsu ◽

Yu Tsao

Keyword(s):

Conservation Management ◽

Marine Ecosystem ◽

Source Separation ◽

Ecosystem Dynamics ◽

Marine Animals ◽

Time Frequency ◽

Audio Source Separation ◽

Diversity Assessment ◽

Separation Model

Remote acquisition of information on ecosystem dynamics is essential for conservation management, especially for the deep ocean. Soundscape offers unique opportunities to study the behavior of soniferous marine animals and their interactions with various noise-generating activities at a fine temporal resolution. However, the retrieval of soundscape information remains challenging owing to limitations in audio analysis techniques that are effective in the face of highly variable interfering sources. This study investigated the application of a seafloor acoustic observatory as a long-term platform for observing marine ecosystem dynamics through audio source separation. A source separation model based on the assumption of source-specific periodicity was used to factorize time-frequency representations of long-duration underwater recordings. With minimal supervision, the model learned to discriminate source-specific spectral features and prove to be effective in the separation of sounds made by cetaceans, soniferous fish, and abiotic sources from the deep-water soundscapes off northeastern Taiwan. Results revealed phenological differences among the sound sources and identified diurnal and seasonal interactions between cetaceans and soniferous fish. The application of clustering to source separation results generated a database featuring the diversity of soundscapes and revealed a compositional shift in clusters of cetacean vocalizations and fish choruses during diurnal and seasonal cycles. The source separation model enables the transformation of single-channel audio into multiple channels encoding the dynamics of biophony, geophony, and anthropophony, which are essential for characterizing the community of soniferous animals, quality of acoustic habitat, and their interactions. Our results demonstrated the application of source separation could facilitate acoustic diversity assessment, which is a crucial task in soundscape-based ecosystem monitoring. Future implementation of soundscape information retrieval in long-term marine observation networks will lead to the use of soundscapes as a new tool for conservation management in an increasingly noisy ocean.

Download Full-text

Psychoacoustic measures of blind audio source separation performance

The Journal of the Acoustical Society of America ◽

10.1121/1.2935809 ◽

2008 ◽

Vol 123 (5) ◽

pp. 3885-3885

Author(s):

Mingu Lee ◽

Inseok Heo ◽

Nakjin Choi ◽

Koeng‐Mo Sung

Keyword(s):

Source Separation ◽

Separation Performance ◽

Audio Source Separation

Download Full-text

Time-series forecasting with deep learning: a survey

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2020.0209 ◽

2021 ◽

Vol 379 (2194) ◽

pp. 20200209

Author(s):

Bryan Lim ◽

Stefan Zohren

Keyword(s):

Time Series ◽

Deep Learning ◽

Time Series Data ◽

Time Series Forecasting ◽

Series Data ◽

Climate Modelling ◽

Weather And Climate ◽

Recent Developments ◽

One Step ◽

Learning Architectures

Numerous deep learning architectures have been developed to accommodate the diversity of time-series datasets across different domains. In this article, we survey common encoder and decoder designs used in both one-step-ahead and multi-horizon time-series forecasting—describing how temporal information is incorporated into predictions by each model. Next, we highlight recent developments in hybrid deep learning models, which combine well-studied statistical models with neural network components to improve pure methods in either category. Lastly, we outline some ways in which deep learning can also facilitate decision support with time-series data. This article is part of the theme issue ‘Machine learning for weather and climate modelling’.

Download Full-text

Itakura-Saito Nonnegative Factorizations of the Power Spectrogram for Music Signal Decomposition

Machine Audition ◽

10.4018/978-1-61520-919-4.ch011 ◽

2010 ◽

pp. 266-296 ◽

Cited By ~ 4

Author(s):

Cédric Févotte

Keyword(s):

Source Separation ◽

Nonnegative Matrix ◽

Approximate Factorization ◽

Model Order Selection ◽

Music Transcription ◽

Audio Source Separation ◽

Music Signal ◽

Variance Parameters ◽

Signal Image ◽

Inference Techniques

Nonnegative matrix factorization (NMF) is a popular linear regression technique in the fields of machine learning and signal/image processing. Much research about this topic has been driven by applications in audio. NMF has been for example applied with success to automatic music transcription and audio source separation, where the data is usually taken as the magnitude spectrogram of the sound signal, and the Euclidean distance or Kullback-Leibler divergence are used as measures of fit between the original spectrogram and its approximate factorization. In this chapter the authorsgive evidence of the relevance of considering factorization of the power spectrogram, with the Itakura-Saito (IS) divergence. Indeed, IS-NMF is shown to be connected to maximum likelihood inference of variance parameters in a well-defined statistical model of superimposed Gaussian components and this model is in turn shown to be well suited to audio. Furthermore, the statistical setting opens doors to Bayesian approaches and to a variety of computational inference techniques. They discuss in particular model order selection strategies and Markov regularization of the activation matrix, to account for time-persistence in audio. This chapter also discusses extensions of NMF to the multichannel case, in both instantaneous or convolutive recordings, possibly underdetermined. The authors present in particular audio source separation results of a real stereo musical excerpt.

Download Full-text