Emotional Variability Analysis Based I-Vector for Speaker Verification in Under-Stress Conditions

Barlian Henryranu Prasetio; Hiroki Tamura; Koichi Tanno

doi:10.3390/electronics9091420

Emotional Variability Analysis Based I-Vector for Speaker Verification in Under-Stress Conditions

Electronics ◽

10.3390/electronics9091420 ◽

2020 ◽

Vol 9 (9) ◽

pp. 1420

Author(s):

Barlian Henryranu Prasetio ◽

Hiroki Tamura ◽

Koichi Tanno

Keyword(s):

State Of The Art ◽

Speaker Verification ◽

Vector System ◽

Verification Task ◽

Scoring Methods ◽

Equal Error Rate ◽

Variability Analysis ◽

Ablation Study ◽

Speech Segments ◽

Neutral Conditions

Emotional conditions cause changes in the speech production system. It produces the differences in the acoustical characteristics compared to neutral conditions. The presence of emotion makes the performance of a speaker verification system degrade. In this paper, we propose a speaker modeling that accommodates the presence of emotions on the speech segments by extracting a speaker representation compactly. The speaker model is estimated by following a similar procedure to the i-vector technique, but it considerate the emotional effect as the channel variability component. We named this method as the emotional variability analysis (EVA). EVA represents the emotion subspace separately to the speaker subspace, like the joint factor analysis (JFA) model. The effectiveness of the proposed system is evaluated by comparing it with the standard i-vector system in the speaker verification task of the Speech Under Simulated and Actual Stress (SUSAS) dataset with three different scoring methods. The evaluation focus in terms of the equal error rate (EER). In addition, we also conducted an ablation study for a more comprehensive analysis of the EVA-based i-vector. Based on experiment results, the proposed system outperformed the standard i-vector system and achieved state-of-the-art results in the verification task for the under-stressed speakers.

Download Full-text

Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System

Electronics ◽

10.3390/electronics9101706 ◽

2020 ◽

Vol 9 (10) ◽

pp. 1706

Author(s):

Soonshin Seo ◽

Ji-Hwan Kim

Keyword(s):

State Of The Art ◽

Speaker Verification ◽

Model Parameters ◽

Equal Error Rate ◽

Layer Depth ◽

Verification System ◽

Evaluation Dataset ◽

Representational Power ◽

Fully Connected ◽

Text Independent Speaker Verification

One of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspecified variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. To reduce the number of model parameters, we set the ResNet with the scaled channel width and layer depth as a baseline. To control the variability in the training, we apply a self-attention mechanism to perform multi-layer aggregation with dropout regularizations and batch normalizations. Subsequently, we apply a feature recalibration layer to the aggregated feature using fully-connected layers and nonlinear activation functions. Further, deep length normalization is used on a recalibrated feature in the training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively).

Download Full-text

Introducing phonetic information to speaker embedding for speaker verification

EURASIP Journal on Audio Speech and Music Processing ◽

10.1186/s13636-019-0166-8 ◽

2019 ◽

Vol 2019 (1) ◽

Cited By ~ 1

Author(s):

Yi Liu ◽

Liang He ◽

Jia Liu ◽

Michael T. Johnson

Keyword(s):

Speech Processing ◽

Speaker Recognition ◽

Speaker Verification ◽

Vector System ◽

Equal Error Rate ◽

Phonetic Information ◽

Essential Components ◽

Verification Systems ◽

Vector Architectures ◽

Vector Approach

AbstractPhonetic information is one of the most essential components of a speech signal, playing an important role for many speech processing tasks. However, it is difficult to integrate phonetic information into speaker verification systems since it occurs primarily at the frame level while speaker characteristics typically reside at the segment level. In deep neural network-based speaker verification, existing methods only apply phonetic information to the frame-wise trained speaker embeddings. To improve this weakness, this paper proposes phonetic adaptation and hybrid multi-task learning and further combines these into c-vector and simplified c-vector architectures. Experiments on National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2010 show that the four proposed speaker embeddings achieve better performance than the baseline. The c-vector system performs the best, providing over 30% and 15% relative improvements in equal error rate (EER) for the core-extended and 10 s–10 s conditions, respectively. On the NIST SRE 2016, 2018, and VoxCeleb datasets, the proposed c-vector approach improves the performance even when there is a language mismatch within the training sets or between the training and evaluation sets. Extensive experimental results demonstrate the effectiveness and robustness of the proposed methods.

Download Full-text

BEHRT-HF: an interpretable transformer-based, deep learning model for prediction of incident heart failure

European Heart Journal ◽

10.1093/ehjci/ehaa946.3553 ◽

2020 ◽

Vol 41 (Supplement_2) ◽

Author(s):

S Rao ◽

Y Li ◽

R Ramakrishnan ◽

A Hassaine ◽

D Canoy ◽

...

Keyword(s):

Heart Failure ◽

Deep Learning ◽

State Of The Art ◽

Failure Prediction ◽

Predictive Performance ◽

Learning Model ◽

Learning Framework ◽

Incident Heart Failure ◽

Ablation Study ◽

Deep Learning Model

Abstract Background/Introduction Predicting incident heart failure has been challenging. Deep learning models when applied to rich electronic health records (EHR) offer some theoretical advantages. However, empirical evidence for their superior performance is limited and they remain commonly uninterpretable, hampering their wider use in medical practice. Purpose We developed a deep learning framework for more accurate and yet interpretable prediction of incident heart failure. Methods We used longitudinally linked EHR from practices across England, involving 100,071 patients, 13% of whom had been diagnosed with incident heart failure during follow-up. We investigated the predictive performance of a novel transformer deep learning model, “Transformer for Heart Failure” (BEHRT-HF), and validated it using both an external held-out dataset and an internal five-fold cross-validation mechanism using area under receiver operating characteristic (AUROC) and area under the precision recall curve (AUPRC). Predictor groups included all outpatient and inpatient diagnoses within their temporal context, medications, age, and calendar year for each encounter. By treating diagnoses as anchors, we alternatively removed different modalities (ablation study) to understand the importance of individual modalities to the performance of incident heart failure prediction. Using perturbation-based techniques, we investigated the importance of associations between selected predictors and heart failure to improve model interpretability. Results BEHRT-HF achieved high accuracy with AUROC 0.932 and AUPRC 0.695 for external validation, and AUROC 0.933 (95% CI: 0.928, 0.938) and AUPRC 0.700 (95% CI: 0.682, 0.718) for internal validation. Compared to the state-of-the-art recurrent deep learning model, RETAIN-EX, BEHRT-HF outperformed it by 0.079 and 0.030 in terms of AUPRC and AUROC. Ablation study showed that medications were strong predictors, and calendar year was more important than age. Utilising perturbation, we identified and ranked the intensity of associations between diagnoses and heart failure. For instance, the method showed that established risk factors including myocardial infarction, atrial fibrillation and flutter, and hypertension all strongly associated with the heart failure prediction. Additionally, when population was stratified into different age groups, incident occurrence of a given disease had generally a higher contribution to heart failure prediction in younger ages than when diagnosed later in life. Conclusions Our state-of-the-art deep learning framework outperforms the predictive performance of existing models whilst enabling a data-driven way of exploring the relative contribution of a range of risk factors in the context of other temporal information. Funding Acknowledgement Type of funding source: Private grant(s) and/or Sponsorship. Main funding source(s): National Institute for Health Research, Oxford Martin School, Oxford Biomedical Research Centre

Download Full-text

Speaker verification system based on articulatory information from ultrasound recordings

DYNA ◽

10.15446/dyna.v87n213.81772 ◽

2020 ◽

Vol 87 (213) ◽

pp. 9-16

Author(s):

Franklin Alexander Sepulveda Sepulveda ◽

Dagoberto Porras-Plata ◽

Milton Sarria-Paja

Keyword(s):

State Of The Art ◽

Speaker Verification ◽

Environmental Noise ◽

Speech Signals ◽

Acoustic Information ◽

Current State ◽

Verification System ◽

Vocal Effort ◽

Ultrasound System ◽

Verification Systems

Current state-of-the-art speaker verification (SV) systems are known to be strongly affected by unexpected variability presented during testing, such as environmental noise or changes in vocal effort. In this work, we analyze and evaluate articulatory information of the tongue's movement as a means to improve the performance of speaker verification systems. We use a Spanish database, where besides the speech signals, we also include articulatory information that was acquired with an ultrasound system. Two groups of features are proposed to represent the articulatory information, and the obtained performance is compared to an SV system trained only with acoustic information. Our results show that the proposed features contain highly discriminative information, and they are related to speaker identity; furthermore, these features can be used to complement and improve existing systems by combining such information with cepstral coefficients at the feature level.

Download Full-text

Neural Embedding-Based Metrics for Pre-retrieval Query Performance Prediction

10.32920/ryerson.14654253.v1 ◽

2021 ◽

Author(s):

Arabzadehghahyazi Negar

Keyword(s):

Performance Prediction ◽

State Of The Art ◽

Learning To Rank ◽

The State ◽

Test Collection ◽

Query Performance ◽

Performance Predictors ◽

Level Statistics ◽

Ablation Study ◽

Individual Specificity

file:///C:/Users/MWF/Downloads/Arabzadehghahyazi, Negar.Pre-retrieval Query Performance Prediction (QPP) methods are oblivious to the performance of the retrieval model as they predict query difficulty prior to observing the set of documents retrieved for the query. Among pre-retrieval query performance predictors, specificity-based metrics investigate how corpus, query and corpus-query level statistics can be used to predict the performance of the query. In this thesis, we explore how neural embeddings can be utilized to define corpus-independent and semantics-aware specificity metrics. Our metrics are based on the intuition that a term that is closely surrounded by other terms in the embedding space is more likely to be specific while a term surrounded by less closely related terms is more likely to be generic. On this basis, we leverage geometric properties between embedded terms to define four groups of metrics: (1) neighborhood-based, (2) graph-based, (3) cluster-based and (4) vector-based metrics. Moreover, we employ learning-to-rank techniques to analyze the importance of individual specificity metrics. To evaluate the proposed metrics, we have curated and publicly share a test collection of term specificity measurements defined based on Wikipedia category hierarchy and DMOZ taxonomy. We report on our extensive experiments on the effectiveness of our metrics through metric comparison, ablation study and comparison against the state-of-the-art baselines. We have shown that our proposed set of pre-retrieval QPP metrics based on the properties of pre-trained neural embeddings are more effective for performance prediction compared to the state-of-the-art methods. We report our findings based on Robust04, ClueWeb09 and Gov2 corpora and their associated TREC topics.

Download Full-text

An investigation into direct scoring methods without SVM training in speaker verification

10.21437/interspeech.2010-143 ◽

2010 ◽

Author(s):

Ce Zhang ◽

Rong Zheng ◽

Bo Xu

Keyword(s):

Speaker Verification ◽

Scoring Methods

Download Full-text

Regularized Within-Class Precision Matrix Based PLDA in Text-Dependent Speaker Verification

Applied Sciences ◽

10.3390/app10186571 ◽

2020 ◽

Vol 10 (18) ◽

pp. 6571 ◽

Cited By ~ 1

Author(s):

Sung-Hyun Yoon ◽

Jong-June Jeon ◽

Ha-Jin Yu

Keyword(s):

Conditional Independence ◽

Speaker Verification ◽

Equal Error Rate ◽

Estimation Errors ◽

Precision Matrix ◽

Linear Discriminant ◽

Selection Operator ◽

Independence Structure ◽

Empirical Covariance ◽

Text Dependent Speaker Verification

In the field of speaker verification, probabilistic linear discriminant analysis (PLDA) is the dominant method for back-end scoring. To estimate the PLDA model, the between-class covariance and within-class precision matrices must be estimated from samples. However, the empirical covariance/precision estimated from samples has estimation errors due to the limited number of samples available. In this paper, we propose a method to improve the conventional PLDA by estimating the PLDA model using the regularized within-class precision matrix. We use graphical least absolute shrinking and selection operator (GLASSO) for the regularization. The GLASSO regularization decreases the estimation errors in the empirical precision matrix by making the precision matrix sparse, which corresponds to the reflection of the conditional independence structure. The experimental results on text-dependent speaker verification reveal that the proposed method reduce the relative equal error rate by up to 23% compared with the conventional PLDA.

Download Full-text

State-of-the-art sequence kernels for SVM speaker verification

2008 IEEE Workshop on Machine Learning for Signal Processing ◽

10.1109/mlsp.2008.4685530 ◽

2008 ◽

Cited By ~ 1

Author(s):

Jerome Louradour ◽

Khalid Daoudi

Keyword(s):

State Of The Art ◽

Speaker Verification

Download Full-text

Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018658 ◽

2019 ◽

Vol 33 ◽

pp. 8658-8665 ◽

Cited By ~ 10

Author(s):

Xiangpeng Li ◽

Jingkuan Song ◽

Lianli Gao ◽

Xianglong Liu ◽

Wenbing Huang ◽

...

Keyword(s):

Question Answering ◽

State Of The Art ◽

Computation Time ◽

Comparable Result ◽

Video Encoding ◽

Visual Question Answering ◽

Proposed Model ◽

Ablation Study ◽

The Given ◽

Video Question Answering

Most of the recent progresses on visual question answering are based on recurrent neural networks (RNNs) with attention. Despite the success, these models are often timeconsuming and having difficulties in modeling long range dependencies due to the sequential nature of RNNs. We propose a new architecture, Positional Self-Attention with Coattention (PSAC), which does not require RNNs for video question answering. Specifically, inspired by the success of self-attention in machine translation task, we propose a Positional Self-Attention to calculate the response at each position by attending to all positions within the same sequence, and then add representations of absolute positions. Therefore, PSAC can exploit the global dependencies of question and temporal information in the video, and make the process of question and video encoding executed in parallel. Furthermore, in addition to attending to the video features relevant to the given questions (i.e., video attention), we utilize the co-attention mechanism by simultaneously modeling “what words to listen to” (question attention). To the best of our knowledge, this is the first work of replacing RNNs with selfattention for the task of visual question answering. Experimental results of four tasks on the benchmark dataset show that our model significantly outperforms the state-of-the-art on three tasks and attains comparable result on the Count task. Our model requires less computation time and achieves better performance compared with the RNNs-based methods. Additional ablation study demonstrates the effect of each component of our proposed model.

Download Full-text

Mean-Delta Features for Telephone Speech Endpoint Detection

Information Technologies and Control ◽

10.1515/itc-2016-0005 ◽

2014 ◽

Vol 12 (3-4) ◽

pp. 36-44

Author(s):

A. Ouzounov

Keyword(s):

Dynamic Time Warping ◽

Group Delay ◽

Speaker Verification ◽

Verification Task ◽

Endpoint Detection ◽

Time Warping ◽

The Mean ◽

Telephone Speech ◽

Dynamic Time ◽

Speech Endpoint Detection

Abstract In this paper, a brief summary of the author’s research in the field of the contour-based telephone speech Endpoint Detection (ED) is presented. This research includes: development of new robust features for ED – the Mean-Delta feature and the Group Delay Mean-Delta feature and estimation of the effect of the analyzed ED features and two additional features in the Dynamic Time Warping fixed-text speaker verification task with short noisy telephone phrases in Bulgarian language.

Download Full-text