Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings

Woo Hyun Kang; Nam Soo Kim

doi:10.3390/app9081597

Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings

Applied Sciences ◽

10.3390/app9081597 ◽

2019 ◽

Vol 9 (8) ◽

pp. 1597 ◽

Cited By ~ 3

Author(s):

Woo Hyun Kang ◽

Nam Soo Kim

Keyword(s):

Short Duration ◽

Latent Variable ◽

Speaker Verification ◽

Gaussian Mixture ◽

Training Data ◽

Optimal Method ◽

Vector Method ◽

Speaker Variability ◽

Total Variability ◽

Increasing Demand

Recently, the increasing demand for voice-based authentication systems has encouraged researchers to investigate methods for verifying users with short randomized pass-phrases with constrained vocabulary. The conventional i-vector framework, which has been proven to be a state-of-the-art utterance-level feature extraction technique for speaker verification, is not considered to be an optimal method for this task since it is known to suffer from severe performance degradation when dealing with short-duration speech utterances. More recent approaches that implement deep-learning techniques for embedding the speaker variability in a non-linear fashion have shown impressive performance in various speaker verification tasks. However, since most of these techniques are trained in a supervised manner, which requires speaker labels for the training data, it is difficult to use them when a scarce amount of labeled data is available for training. In this paper, we propose a novel technique for extracting an i-vector-like feature based on the variational autoencoder (VAE), which is trained in an unsupervised manner to obtain a latent variable representing the variability within a Gaussian mixture model (GMM) distribution. The proposed framework is compared with the conventional i-vector method using the TIDIGITS dataset. Experimental results showed that the proposed method could cope with the performance deterioration caused by the short duration. Furthermore, the performance of the proposed approach improved significantly when applied in conjunction with the conventional i-vector framework.

Download Full-text

Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit Strings

Sensors ◽

10.3390/s19214709 ◽

2019 ◽

Vol 19 (21) ◽

pp. 4709 ◽

Cited By ~ 2

Author(s):

Woo Hyun Kang ◽

Nam Soo Kim

Keyword(s):

Speaker Recognition ◽

Latent Variable ◽

Nonlinear Process ◽

Gaussian Mixture ◽

Leibler Divergence ◽

Feature Extractor ◽

Total Variability ◽

Feature Based ◽

Model Training ◽

Increasing Demand

Over the recent years, various research has been conducted to investigate methods for verifying users with a short randomized pass-phrase due to the increasing demand for voice-based authentication systems. In this paper, we propose a novel technique for extracting an i-vector-like feature based on an adversarially learned inference (ALI) model which summarizes the variability within the Gaussian mixture model (GMM) distribution through a nonlinear process. Analogous to the previously proposed variational autoencoder (VAE)-based feature extractor, the proposed ALI-based model is trained to generate the GMM supervector according to the maximum likelihood criterion given the Baum–Welch statistics of the input utterance. However, to prevent the potential loss of information caused by the Kullback–Leibler divergence (KL divergence) regularization adopted in the VAE-based model training, the newly proposed ALI-based feature extractor exploits a joint discriminator to ensure that the generated latent variable and the GMM supervector are more realistic. The proposed framework is compared with the conventional i-vector and VAE-based methods using the TIDIGITS dataset. Experimental results show that the proposed method can represent the uncertainty caused by the short duration better than the VAE-based method. Furthermore, the proposed approach has shown great performance when applied in association with the standard i-vector framework.

Download Full-text

An analysis of speaker dependent models in replay detection

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2020.9 ◽

2020 ◽

Vol 9 ◽

Author(s):

Gajan Suthokumar ◽

Kaavya Sriskandaraja ◽

Vidhyasaharan Sethu ◽

Eliathamby Ambikairajah ◽

Haizhou Li

Keyword(s):

Speaker Verification ◽

Detection System ◽

Gaussian Mixture ◽

Equal Error Rate ◽

Speaker Variability ◽

Speaker Independent ◽

Spoofing Detection ◽

Potential Benefits ◽

Target Speaker ◽

Small Improvement

Most research on replay detection has focused on developing a stand-alone countermeasure that runs independently of a speaker verification system by training a single spoofed model and a single genuine model for all speakers. In this paper, we explore the potential benefits of adapting the back-end of a spoofing detection system towards the claimed target speaker. Specifically, we characterize and quantify speaker variability by comparing speaker-dependent and speaker-independent (SI) models of feature distributions for both genuine and spoofed speech. Following this, we develop an approach for implementing speaker-dependent spoofing detection using a Gaussian mixture model (GMM) back-end, where both the genuine and spoofed models are adapted to the claimed speaker. Finally, we also develop and evaluate a speaker-specific neural network-based spoofing detection system in addition to the GMM based back-end. Evaluations of the proposed approaches on replay corpora BTAS2016 and ASVspoof2017 v2.0 reveal that the proposed speaker-dependent spoofing detection outperforms equivalent SI replay detection baselines on both datasets. Our experimental results show that the use of speaker-specific genuine models leads to a significant improvement (around 4% in terms of equal error rate (EER)) as previously shown and the addition of speaker-specific spoofed models adds a small improvement on top (less than 1% in terms of EER).

Download Full-text

Sketching for large-scale learning of mixture models

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iax015 ◽

2017 ◽

Vol 7 (3) ◽

pp. 447-508 ◽

Cited By ~ 5

Author(s):

Nicolas Keriven ◽

Anthony Bourrier ◽

Rémi Gribonval ◽

Patrick Pérez

Keyword(s):

Compressive Sensing ◽

Large Scale ◽

Speaker Verification ◽

Synthetic Data ◽

Gaussian Mixture ◽

Training Data ◽

Model Parameters ◽

Reconstruction Algorithms ◽

Translation Invariant ◽

Generalized Moments

Abstract Learning parameters from voluminous data can be prohibitive in terms of memory and computational requirements. We propose a ‘compressive learning’ framework, where we estimate model parameters from a sketch of the training data. This sketch is a collection of generalized moments of the underlying probability distribution of the data. It can be computed in a single pass on the training set and is easily computable on streams or distributed datasets. The proposed framework shares similarities with compressive sensing, which aims at drastically reducing the dimension of high-dimensional signals while preserving the ability to reconstruct them. To perform the estimation task, we derive an iterative algorithm analogous to sparse reconstruction algorithms in the context of linear inverse problems. We exemplify our framework with the compressive estimation of a Gaussian mixture model (GMM), providing heuristics on the choice of the sketching procedure and theoretical guarantees of reconstruction. We experimentally show on synthetic data that the proposed algorithm yields results comparable to the classical expectation-maximization technique while requiring significantly less memory and fewer computations when the number of database elements is large. We further demonstrate the potential of the approach on real large-scale data (over $10^{8}$ training samples) for the task of model-based speaker verification. Finally, we draw some connections between the proposed framework and approximate Hilbert space embedding of probability distributions using random features. We show that the proposed sketching operator can be seen as an innovative method to design translation-invariant kernels adapted to the analysis of GMMs. We also use this theoretical framework to derive preliminary information preservation guarantees, in the spirit of infinite-dimensional compressive sensing.

Download Full-text

Angular Softmax for Short-Duration Text-independent Speaker Verification

10.21437/interspeech.2018-1545 ◽

2018 ◽

Cited By ~ 20

Author(s):

Zili Huang ◽

Shuai Wang ◽

Kai Yu

Keyword(s):

Short Duration ◽

Speaker Verification ◽

Text Independent Speaker Verification

Download Full-text

Combining amplitude and phase-based features for speaker verification with short duration utterances

10.21437/interspeech.2015-94 ◽

2015 ◽

Author(s):

Md. Jahangir Alam ◽

Patrick Kenny ◽

Themos Stafylakis

Keyword(s):

Short Duration ◽

Speaker Verification

Download Full-text

Self-Adaptive Multi-Sensor Activity Recognition Systems Based on Gaussian Mixture Models

Informatics ◽

10.3390/informatics5030038 ◽

2018 ◽

Vol 5 (3) ◽

pp. 38 ◽

Cited By ~ 3

Author(s):

Martin Jänicke ◽

Bernhard Sick ◽

Sven Tomforde

Keyword(s):

Activity Recognition ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Recognition System ◽

Training Data ◽

Sensor Data ◽

Activity Data ◽

High Loss ◽

New Sensors ◽

Self Adaptive

Personal wearables such as smartphones or smartwatches are increasingly utilized in everyday life. Frequently, activity recognition is performed on these devices to estimate the current user status and trigger automated actions according to the user’s needs. In this article, we focus on the creation of a self-adaptive activity recognition system based on IMU that includes new sensors during runtime. Starting with a classifier based on GMM, the density model is adapted to new sensor data fully autonomously by issuing the marginalization property of normal distributions. To create a classifier from that, label inference is done, either based on the initial classifier or based on the training data. For evaluation, we used more than 10 h of annotated activity data from the publicly available PAMAP2 benchmark dataset. Using the data, we showed the feasibility of our approach and performed 9720 experiments, to get resilient numbers. One approach performed reasonably well, leading to a system improvement on average, with an increase in the F-score of 0.0053, while the other one shows clear drawbacks due to a high loss of information during label inference. Furthermore, a comparison with state of the art techniques shows the necessity for further experiments in this area.

Download Full-text

UCSY-SC1: A Myanmar speech corpus for automatic speech recognition

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v9i4.pp3194-3202 ◽

2019 ◽

Vol 9 (4) ◽

pp. 3194 ◽

Cited By ~ 1

Author(s):

Aye Nyein Mon ◽

Win Pa Pa ◽

Ye Kyaw Thu

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Automatic Speech Recognition ◽

Gaussian Mixture ◽

Error Rates ◽

Training Data ◽

Speech Corpus ◽

Total Size ◽

Test Sets ◽

Web News

This paper introduces a speech corpus which is developed for Myanmar Automatic Speech Recognition (ASR) research. Automatic Speech Recognition (ASR) research has been conducted by the researchers around the world to improve their language technologies. Speech corpora are important in developing the ASR and the creation of the corpora is necessary especially for low-resourced languages. Myanmar language can be regarded as a low-resourced language because of lack of pre-created resources for speech processing research. In this work, a speech corpus named UCSY-SC1 (University of Computer Studies Yangon - Speech Corpus1) is created for Myanmar ASR research. The corpus consists of two types of domain: news and daily conversations. The total size of the speech corpus is over 42 hrs. There are 25 hrs of web news and 17 hrs of conversational recorded data.<br />The corpus was collected from 177 females and 84 males for the news data and 42 females and 4 males for conversational domain. This corpus was used as training data for developing Myanmar ASR. Three different types of acoustic models such as Gaussian Mixture Model (GMM) - Hidden Markov Model (HMM), Deep Neural Network (DNN), and Convolutional Neural Network (CNN) models were built and compared their results. Experiments were conducted on different data sizes and evaluation is done by two test sets: TestSet1, web news and TestSet2, recorded conversational data. It showed that the performance of Myanmar ASRs using this corpus gave satisfiable results on both test sets. The Myanmar ASR using this corpus leading to word error rates of 15.61% on TestSet1 and 24.43% on TestSet2.<br /><br />

Download Full-text