An analysis of speaker dependent models in replay detection

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2020.9 ◽

2020 ◽

Vol 9 ◽

Author(s):

Gajan Suthokumar ◽

Kaavya Sriskandaraja ◽

Vidhyasaharan Sethu ◽

Eliathamby Ambikairajah ◽

Haizhou Li

Keyword(s):

Speaker Verification ◽

Detection System ◽

Gaussian Mixture ◽

Equal Error Rate ◽

Speaker Variability ◽

Speaker Independent ◽

Spoofing Detection ◽

Potential Benefits ◽

Target Speaker ◽

Small Improvement

Most research on replay detection has focused on developing a stand-alone countermeasure that runs independently of a speaker verification system by training a single spoofed model and a single genuine model for all speakers. In this paper, we explore the potential benefits of adapting the back-end of a spoofing detection system towards the claimed target speaker. Specifically, we characterize and quantify speaker variability by comparing speaker-dependent and speaker-independent (SI) models of feature distributions for both genuine and spoofed speech. Following this, we develop an approach for implementing speaker-dependent spoofing detection using a Gaussian mixture model (GMM) back-end, where both the genuine and spoofed models are adapted to the claimed speaker. Finally, we also develop and evaluate a speaker-specific neural network-based spoofing detection system in addition to the GMM based back-end. Evaluations of the proposed approaches on replay corpora BTAS2016 and ASVspoof2017 v2.0 reveal that the proposed speaker-dependent spoofing detection outperforms equivalent SI replay detection baselines on both datasets. Our experimental results show that the use of speaker-specific genuine models leads to a significant improvement (around 4% in terms of equal error rate (EER)) as previously shown and the addition of speaker-specific spoofed models adds a small improvement on top (less than 1% in terms of EER).

Download Full-text

Short-time speaker verification with different speaking style utterances

PLoS ONE ◽

10.1371/journal.pone.0241809 ◽

2020 ◽

Vol 15 (11) ◽

pp. e0241809

Author(s):

Hongwei Mao ◽

Yan Shi ◽

Yue Liu ◽

Linqiang Wei ◽

Yijie Li ◽

...

Keyword(s):

State Of The Art ◽

Speaker Verification ◽

Feature Space ◽

Gaussian Mixture ◽

Normal Reading ◽

Great Progress ◽

Short Time ◽

Speaking Style ◽

Target Speaker ◽

Made In

In recent years, great progress has been made in the technical aspects of automatic speaker verification (ASV). However, the promotion of ASV technology is still a very challenging issue, because most technologies are still very sensitive to new, unknown and spoofing conditions. Most previous studies focused on extracting target speaker information from natural speech. This paper aims to design a new ASV corpus with multi-speaking styles and investigate the ASV robustness to these different speaking styles. We first release this corpus in the Zenodo website for public research, in which each speaker has several text-dependent and text-independent singing, humming and normal reading speech utterances. Then, we investigate the speaker discrimination of each speaking style in the feature space. Furthermore, the intra and inter-speaker variabilities in each different speaking style and cross-speaking styles are investigated in both text-dependent and text-independent ASV tasks. Conventional Gaussian Mixture Model (GMM), and the state-of-the-art x-vector are used to build ASV systems. Experimental results show that the voiceprint information in humming and singing speech are more distinguishable than that in normal reading speech for conventional ASV systems. Furthermore, we find that combing the three speaking styles can significantly improve the x-vector based ASV system, even when only limited gains are obtained by conventional GMM-based systems.

Download Full-text

A COMPARATIVE STUDY ON KERNEL-BASED PROBABILISTIC NEURAL NETWORKS FOR SPEAKER VERIFICATION

International Journal of Neural Systems ◽

10.1142/s0129065702001278 ◽

2002 ◽

Vol 12 (05) ◽

pp. 381-397 ◽

Cited By ~ 4

Author(s):

K. K. YIU ◽

M. W. MAK ◽

S. Y. KUNG

Keyword(s):

Neural Networks ◽

Ad Hoc ◽

Speaker Verification ◽

Gaussian Mixture Models ◽

Gaussian Mixture ◽

Error Rates ◽

Training Algorithm ◽

Probabilistic Neural Networks ◽

Equal Error Rate ◽

Acceptance Rates

This paper compares kernel-based probabilistic neural networks for speaker verification based on 138 speakers of the YOHO corpus. Experimental evaluations using probabilistic decision-based neural networks (PDBNNs), Gaussian mixture models (GMMs) and elliptical basis function networks (EBFNs) as speaker models were conducted. The original training algorithm of PDBNNs was also modified to make PDBNNs appropriate for speaker verification. Results show that the equal error rate obtained by PDBNNs and GMMs is less than that of EBFNs (0.33% vs. 0.48%), suggesting that GMM- and PDBNN-based speaker models outperform the EBFN ones. This work also finds that the globally supervised learning of PDBNNs is able to find decision thresholds that not only maintain the false acceptance rates to a low level but also reduce their variation, whereas the ad-hoc threshold-determination approach used by the EBFNs and GMMs causes a large variation in the error rates. This property makes the performance of PDBNN-based systems more predictable.

Download Full-text

Replay Spoofing Detection System for Automatic Speaker Verification Using Multi-Task Learning of Noise Classes

2018 Conference on Technologies and Applications of Artificial Intelligence (TAAI) ◽

10.1109/taai.2018.00046 ◽

2018 ◽

Author(s):

Hye-Jin Shim ◽

Jee-Weon Jung ◽

Hee-Soo Heo ◽

Sung-Hyun Yoon ◽

Ha-Jin Yu

Keyword(s):

Speaker Verification ◽

Detection System ◽

Task Learning ◽

Spoofing Detection

Download Full-text

Bidirectional Attention for Text-Dependent Speaker Verification

Sensors ◽

10.3390/s20236784 ◽

2020 ◽

Vol 20 (23) ◽

pp. 6784

Author(s):

Xin Fang ◽

Tian Gao ◽

Liang Zou ◽

Zhenhua Ling

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Cost Function ◽

Error Rate ◽

Speaker Verification ◽

Feature Learning ◽

Biometric Authentication ◽

Equal Error Rate ◽

Text Dependent Speaker Verification ◽

Target Speaker

Automatic speaker verification provides a flexible and effective way for biometric authentication. Previous deep learning-based methods have demonstrated promising results, whereas a few problems still require better solutions. In prior works examining speaker discriminative neural networks, the speaker representation of the target speaker is regarded as a fixed one when comparing with utterances from different speakers, and the joint information between enrollment and evaluation utterances is ignored. In this paper, we propose to combine CNN-based feature learning with a bidirectional attention mechanism to achieve better performance with only one enrollment utterance. The evaluation-enrollment joint information is exploited to provide interactive features through bidirectional attention. In addition, we introduce one individual cost function to identify the phonetic contents, which contributes to calculating the attention score more specifically. These interactive features are complementary to the constant ones, which are extracted from individual speakers separately and do not vary with the evaluation utterances. The proposed method archived a competitive equal error rate of 6.26% on the internal “DAN DAN NI HAO” benchmark dataset with 1250 utterances and outperformed various baseline methods, including the traditional i-vector/PLDA, d-vector, self-attention, and sequence-to-sequence attention models.

Download Full-text

Attention-Based LSTM Algorithm for Audio Replay Detection in Noisy Environments

Applied Sciences ◽

10.3390/app9081539 ◽

2019 ◽

Vol 9 (8) ◽

pp. 1539 ◽

Cited By ~ 2

Author(s):

Jiakang Li ◽

Xiongwei Zhang ◽

Meng Sun ◽

Xia Zou ◽

Changyan Zheng

Keyword(s):

Short Term Memory ◽

Gaussian Mixture ◽

Noisy Environments ◽

Equal Error Rate ◽

Signal To Noise ◽

Short Term ◽

Detection Systems ◽

Spoofing Detection ◽

Version 2.0 ◽

Long Short Term Memory

Even though audio replay detection has improved in recent years, its performance is known to severely deteriorate with the existence of strong background noises. Given the fact that different frames of an utterance have different impacts on the performance of spoofing detection, this paper introduces attention-based long short-term memory (LSTM) to extract representative frames for spoofing detection in noisy environments. With this attention mechanism, the specific and representative frame-level features will be automatically selected by adjusting their weights in the framework of attention-based LSTM. The experiments, conducted using the ASVspoof 2017 dataset version 2.0, show that the equal error rate (EER) of the proposed approach was about 13% lower than the constant Q cepstral coefficients-Gaussian mixture model (CQCC-GMM) baseline in noisy environments with four different signal-to-noise ratios (SNR). Meanwhile, the proposed algorithm also improved the performance of traditional LSTM on audio replay detection systems in noisy environments. Experiments using bagging with different frame lengths were also conducted to further improve the proposed approach.

Download Full-text

Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings

Applied Sciences ◽

10.3390/app9081597 ◽

2019 ◽

Vol 9 (8) ◽

pp. 1597 ◽

Cited By ~ 3

Author(s):

Woo Hyun Kang ◽

Nam Soo Kim

Keyword(s):

Short Duration ◽

Latent Variable ◽

Speaker Verification ◽

Gaussian Mixture ◽

Training Data ◽

Optimal Method ◽

Vector Method ◽

Speaker Variability ◽

Total Variability ◽

Increasing Demand

Recently, the increasing demand for voice-based authentication systems has encouraged researchers to investigate methods for verifying users with short randomized pass-phrases with constrained vocabulary. The conventional i-vector framework, which has been proven to be a state-of-the-art utterance-level feature extraction technique for speaker verification, is not considered to be an optimal method for this task since it is known to suffer from severe performance degradation when dealing with short-duration speech utterances. More recent approaches that implement deep-learning techniques for embedding the speaker variability in a non-linear fashion have shown impressive performance in various speaker verification tasks. However, since most of these techniques are trained in a supervised manner, which requires speaker labels for the training data, it is difficult to use them when a scarce amount of labeled data is available for training. In this paper, we propose a novel technique for extracting an i-vector-like feature based on the variational autoencoder (VAE), which is trained in an unsupervised manner to obtain a latent variable representing the variability within a Gaussian mixture model (GMM) distribution. The proposed framework is compared with the conventional i-vector method using the TIDIGITS dataset. Experimental results showed that the proposed method could cope with the performance deterioration caused by the short duration. Furthermore, the performance of the proposed approach improved significantly when applied in conjunction with the conventional i-vector framework.

Download Full-text

Multi-task learning of deep neural networks for joint automatic speaker verification and spoofing detection

2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) ◽

10.1109/apsipaasc47483.2019.9023289 ◽

2019 ◽

Author(s):

Jiakang Li ◽

Meng Sun ◽

Xiongwei Zhang

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Speaker Verification ◽

Task Learning ◽

Spoofing Detection

Download Full-text

Spoofing Detection in Automatic Speaker Verification Systems Using DNN Classifiers and Dynamic Acoustic Features

IEEE Transactions on Neural Networks and Learning Systems ◽

10.1109/tnnls.2017.2771947 ◽

2018 ◽

Vol 29 (10) ◽

pp. 4633-4644 ◽

Cited By ~ 22

Author(s):

Hong Yu ◽

Zheng-Hua Tan ◽

Zhanyu Ma ◽

Rainer Martin ◽

Jun Guo

Keyword(s):

Speaker Verification ◽

Acoustic Features ◽

Spoofing Detection ◽

Verification Systems

Download Full-text

Spoofing detection system for e-health digital twin using EfficientNet Convolution Neural Network

Multimedia Tools and Applications ◽

10.1007/s11042-021-11578-5 ◽

2022 ◽

Author(s):

Hitendra Garg ◽

Bhisham Sharma ◽

Shashi Shekhar ◽

Rohit Agarwal

Keyword(s):

Neural Network ◽

Detection System ◽

Convolution Neural Network ◽

Digital Twin ◽

Spoofing Detection

Download Full-text

Regularized Within-Class Precision Matrix Based PLDA in Text-Dependent Speaker Verification

Applied Sciences ◽

10.3390/app10186571 ◽

2020 ◽

Vol 10 (18) ◽

pp. 6571 ◽

Cited By ~ 1

Author(s):

Sung-Hyun Yoon ◽

Jong-June Jeon ◽

Ha-Jin Yu

Keyword(s):

Conditional Independence ◽

Speaker Verification ◽

Equal Error Rate ◽

Estimation Errors ◽

Precision Matrix ◽

Linear Discriminant ◽

Selection Operator ◽

Independence Structure ◽

Empirical Covariance ◽

Text Dependent Speaker Verification

In the field of speaker verification, probabilistic linear discriminant analysis (PLDA) is the dominant method for back-end scoring. To estimate the PLDA model, the between-class covariance and within-class precision matrices must be estimated from samples. However, the empirical covariance/precision estimated from samples has estimation errors due to the limited number of samples available. In this paper, we propose a method to improve the conventional PLDA by estimating the PLDA model using the regularized within-class precision matrix. We use graphical least absolute shrinking and selection operator (GLASSO) for the regularization. The GLASSO regularization decreases the estimation errors in the empirical precision matrix by making the precision matrix sparse, which corresponds to the reflection of the conditional independence structure. The experimental results on text-dependent speaker verification reveal that the proposed method reduce the relative equal error rate by up to 23% compared with the conventional PLDA.

Download Full-text