Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit Strings

Woo Hyun Kang; Nam Soo Kim

doi:10.3390/s19214709

Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit Strings

Sensors ◽

10.3390/s19214709 ◽

2019 ◽

Vol 19 (21) ◽

pp. 4709 ◽

Cited By ~ 2

Author(s):

Woo Hyun Kang ◽

Nam Soo Kim

Keyword(s):

Speaker Recognition ◽

Latent Variable ◽

Nonlinear Process ◽

Gaussian Mixture ◽

Leibler Divergence ◽

Feature Extractor ◽

Total Variability ◽

Feature Based ◽

Model Training ◽

Increasing Demand

Over the recent years, various research has been conducted to investigate methods for verifying users with a short randomized pass-phrase due to the increasing demand for voice-based authentication systems. In this paper, we propose a novel technique for extracting an i-vector-like feature based on an adversarially learned inference (ALI) model which summarizes the variability within the Gaussian mixture model (GMM) distribution through a nonlinear process. Analogous to the previously proposed variational autoencoder (VAE)-based feature extractor, the proposed ALI-based model is trained to generate the GMM supervector according to the maximum likelihood criterion given the Baum–Welch statistics of the input utterance. However, to prevent the potential loss of information caused by the Kullback–Leibler divergence (KL divergence) regularization adopted in the VAE-based model training, the newly proposed ALI-based feature extractor exploits a joint discriminator to ensure that the generated latent variable and the GMM supervector are more realistic. The proposed framework is compared with the conventional i-vector and VAE-based methods using the TIDIGITS dataset. Experimental results show that the proposed method can represent the uncertainty caused by the short duration better than the VAE-based method. Furthermore, the proposed approach has shown great performance when applied in association with the standard i-vector framework.

Download Full-text

Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings

Applied Sciences ◽

10.3390/app9081597 ◽

2019 ◽

Vol 9 (8) ◽

pp. 1597 ◽

Cited By ~ 3

Author(s):

Woo Hyun Kang ◽

Nam Soo Kim

Keyword(s):

Short Duration ◽

Latent Variable ◽

Speaker Verification ◽

Gaussian Mixture ◽

Training Data ◽

Optimal Method ◽

Vector Method ◽

Speaker Variability ◽

Total Variability ◽

Increasing Demand

Recently, the increasing demand for voice-based authentication systems has encouraged researchers to investigate methods for verifying users with short randomized pass-phrases with constrained vocabulary. The conventional i-vector framework, which has been proven to be a state-of-the-art utterance-level feature extraction technique for speaker verification, is not considered to be an optimal method for this task since it is known to suffer from severe performance degradation when dealing with short-duration speech utterances. More recent approaches that implement deep-learning techniques for embedding the speaker variability in a non-linear fashion have shown impressive performance in various speaker verification tasks. However, since most of these techniques are trained in a supervised manner, which requires speaker labels for the training data, it is difficult to use them when a scarce amount of labeled data is available for training. In this paper, we propose a novel technique for extracting an i-vector-like feature based on the variational autoencoder (VAE), which is trained in an unsupervised manner to obtain a latent variable representing the variability within a Gaussian mixture model (GMM) distribution. The proposed framework is compared with the conventional i-vector method using the TIDIGITS dataset. Experimental results showed that the proposed method could cope with the performance deterioration caused by the short duration. Furthermore, the performance of the proposed approach improved significantly when applied in conjunction with the conventional i-vector framework.

Download Full-text

Speaker Recognition Based on Fusion of a Deep and Shallow Recombination Gaussian Supervector

Electronics ◽

10.3390/electronics10010020 ◽

2020 ◽

Vol 10 (1) ◽

pp. 20

Author(s):

Linhui Sun ◽

Yunyi Bu ◽

Bo Zou ◽

Sheng Fu ◽

Pingan Li

Keyword(s):

Speaker Recognition ◽

Feature Fusion ◽

Recognition Rate ◽

Gaussian Mixture ◽

Recognition Method ◽

Different Types ◽

Feature Based ◽

Mel Frequency Cepstral Coefficient ◽

Fusion Feature ◽

Weight Coefficients

Extracting speaker’s personalized feature parameters is vital for speaker recognition. Only one kind of feature cannot fully reflect the speaker’s personality information. In order to represent the speaker’s identity more comprehensively and improve speaker recognition rate, we propose a speaker recognition method based on the fusion feature of a deep and shallow recombination Gaussian supervector. In this method, the deep bottleneck features are first extracted by Deep Neural Network (DNN), which are used for the input of the Gaussian Mixture Model (GMM) to obtain the deep Gaussian supervector. On the other hand, we input the Mel-Frequency Cepstral Coefficient (MFCC) to GMM directly to extract the traditional Gaussian supervector. Finally, the two categories of features are combined in the form of horizontal dimension augmentation. In addition, when the number of speakers to be recognized increases, in order to prevent the system recognition rate from falling sharply, we introduce the optimization algorithm to find the optimal weight before the feature fusion. The experiment results indicate that the speaker recognition rate based on the feature which is fused directly can reach 98.75%, which is 5% and 0.62% higher than the traditional feature and deep bottleneck feature, respectively. When the number of speakers increases, the fusion feature based on optimized weight coefficients can improve the recognition rate by 0.81%. It is validated that our proposed fusion method can effectively consider the complementarity of the different types of features and improve the speaker recognition rate.

Download Full-text

Modified group delay feature based total variability space modelling for speaker recognition

International Journal of Speech Technology ◽

10.1007/s10772-014-9243-7 ◽

2014 ◽

Vol 18 (1) ◽

pp. 17-23 ◽

Cited By ~ 6

Author(s):

Srikanth R. Madikeri ◽

Asha Talambedu ◽

Hema A. Murthy

Keyword(s):

Speaker Recognition ◽

Group Delay ◽

Total Variability ◽

Feature Based ◽

Modified Group Delay

Download Full-text

Gaussian Mixture Model Training Method Based on Particle Swarm Optimizer for Speaker Recognition

Software Engineering and Applications ◽

10.12677/sea.2013.21001 ◽

2013 ◽

Vol 02 (01) ◽

pp. 1-5

Author(s):

丽萍薛

Keyword(s):

Gaussian Mixture Model ◽

Mixture Model ◽

Speaker Recognition ◽

Particle Swarm ◽

Gaussian Mixture ◽

Training Method ◽

Particle Swarm Optimizer ◽

Model Training

Download Full-text

Speaker Identity Recognition by Acoustic and Visual Data Fusion through Personal Privacy for Smart Care and Service Applications

Journal of Imaging Science and Technology ◽

10.2352/j.imagingsci.technol.2020.64.4.040404 ◽

2020 ◽

Vol 64 (4) ◽

pp. 40404-1-40404-16

Author(s):

I.-J. Ding ◽

C.-M. Ruan

Keyword(s):

Face Detection ◽

Speaker Recognition ◽

Visual Information ◽

Classification Tree ◽

Gaussian Mixture ◽

Recognition Method ◽

Indoor Space ◽

Identity Recognition ◽

Visual Identity ◽

Speaker Classification

Abstract With rapid developments in techniques related to the internet of things, smart service applications such as voice-command-based speech recognition and smart care applications such as context-aware-based emotion recognition will gain much attention and potentially be a requirement in smart home or office environments. In such intelligence applications, identity recognition of the specific member in indoor spaces will be a crucial issue. In this study, a combined audio-visual identity recognition approach was developed. In this approach, visual information obtained from face detection was incorporated into acoustic Gaussian likelihood calculations for constructing speaker classification trees to significantly enhance the Gaussian mixture model (GMM)-based speaker recognition method. This study considered the privacy of the monitored person and reduced the degree of surveillance. Moreover, the popular Kinect sensor device containing a microphone array was adopted to obtain acoustic voice data from the person. The proposed audio-visual identity recognition approach deploys only two cameras in a specific indoor space for conveniently performing face detection and quickly determining the total number of people in the specific space. Such information pertaining to the number of people in the indoor space obtained using face detection was utilized to effectively regulate the accurate GMM speaker classification tree design. Two face-detection-regulated speaker classification tree schemes are presented for the GMM speaker recognition method in this study—the binary speaker classification tree (GMM-BT) and the non-binary speaker classification tree (GMM-NBT). The proposed GMM-BT and GMM-NBT methods achieve excellent identity recognition rates of 84.28% and 83%, respectively; both values are higher than the rate of the conventional GMM approach (80.5%). Moreover, as the extremely complex calculations of face recognition in general audio-visual speaker recognition tasks are not required, the proposed approach is rapid and efficient with only a slight increment of 0.051 s in the average recognition time.

Download Full-text

Speaker recognition based on dynamic time warping and Gaussian mixture model

2020 39th Chinese Control Conference (CCC) ◽

10.23919/ccc50068.2020.9188632 ◽

2020 ◽

Author(s):

Nannan Zhang ◽

Yanru Yao

Keyword(s):

Gaussian Mixture Model ◽

Mixture Model ◽

Speaker Recognition ◽

Dynamic Time Warping ◽

Gaussian Mixture ◽

Time Warping ◽

Dynamic Time

Download Full-text

Concrete Crack Detection Based on Well-Known Feature Extractor Model and the YOLO_v2 Network

Applied Sciences ◽

10.3390/app11020813 ◽

2021 ◽

Vol 11 (2) ◽

pp. 813

Author(s):

Shuai Teng ◽

Zongchao Liu ◽

Gongfa Chen ◽

Li Cheng

Keyword(s):

Feature Extraction ◽

Crack Detection ◽

Computational Cost ◽

Concrete Structures ◽

Detection Algorithm ◽

Computational Time ◽

Image Size ◽

Important Indicator ◽

Feature Extractor ◽

Model Training

This paper compares the crack detection performance (in terms of precision and computational cost) of the YOLO_v2 using 11 feature extractors, which provides a base for realizing fast and accurate crack detection on concrete structures. Cracks on concrete structures are an important indicator for assessing their durability and safety, and real-time crack detection is an essential task in structural maintenance. The object detection algorithm, especially the YOLO series network, has significant potential in crack detection, while the feature extractor is the most important component of the YOLO_v2. Hence, this paper employs 11 well-known CNN models as the feature extractor of the YOLO_v2 for crack detection. The results confirm that a different feature extractor model of the YOLO_v2 network leads to a different detection result, among which the AP value is 0.89, 0, and 0 for ‘resnet18’, ‘alexnet’, and ‘vgg16’, respectively meanwhile, the ‘googlenet’ (AP = 0.84) and ‘mobilenetv2’ (AP = 0.87) also demonstrate comparable AP values. In terms of computing speed, the ‘alexnet’ takes the least computational time, the ‘squeezenet’ and ‘resnet18’ are ranked second and third respectively; therefore, the ‘resnet18’ is the best feature extractor model in terms of precision and computational cost. Additionally, through the parametric study (influence on detection results of the training epoch, feature extraction layer, and testing image size), the associated parameters indeed have an impact on the detection results. It is demonstrated that: excellent crack detection results can be achieved by the YOLO_v2 detector, in which an appropriate feature extractor model, training epoch, feature extraction layer, and testing image size play an important role.

Download Full-text

Duration weighted Gaussian Mixture Model supervector modeling for robust speaker recognition

2013 Ninth International Conference on Natural Computation (ICNC) ◽

10.1109/icnc.2013.6817977 ◽

2013 ◽

Author(s):

Zhe Ji ◽

Wei Hou ◽

Xin Jin ◽

Zhi-Yi Li

Keyword(s):

Gaussian Mixture Model ◽

Mixture Model ◽

Speaker Recognition ◽

Gaussian Mixture ◽

Robust Speaker Recognition

Download Full-text

Deep CNN Based Feature Extractor for Text-Prompted Speaker Recognition

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2018.8462358 ◽

2018 ◽

Cited By ~ 3

Author(s):

Sergey Novoselov ◽

Oleg Kudashev ◽

Vadim Shchemelinin ◽

Ivan Kremnev ◽

GaZina Lavrentyeva

Keyword(s):

Speaker Recognition ◽

Feature Extractor ◽

Deep Cnn

Download Full-text

Comparison of feature extraction and normalization methods for speaker recognition using grid-audiovisual database

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v18.i2.pp782-789 ◽

2020 ◽

Vol 18 (2) ◽

pp. 782

Author(s):

Musab T. S. Al-Kaltakchi ◽

Haithem Abd Al-Raheem Taha ◽

Mohanad Abd Shehab ◽

Mohamed A.M. Abdullah

Keyword(s):

Feature Extraction ◽

Speaker Recognition ◽

Speaker Identification ◽

Gaussian Mixture ◽

Identification Accuracy ◽

Identification System ◽

Good Representation ◽

Mel Frequency Cepstral Coefficients ◽

Normalization Methods ◽

Cepstral Coefficients

<p><span lang="EN-GB">In this paper, different feature extraction and feature normalization methods are investigated for speaker recognition. With a view to give a good representation of acoustic speech signals, Power Normalized Cepstral Coefficients (PNCCs) and Mel Frequency Cepstral Coefficients (MFCCs) are employed for feature extraction. Then, to mitigate the effect of linear channel, Cepstral Mean-Variance Normalization (CMVN) and feature warping are utilized. The current paper investigates Text-independent speaker identification system by using 16 coefficients from both the MFCCs and PNCCs features. Eight different speakers are selected from the GRID-Audiovisual database with two females and six males. The speakers are modeled using the coupling between the Universal Background Model and Gaussian Mixture Models (GMM-UBM) in order to get a fast scoring technique and better performance. The system shows 100% in terms of speaker identification accuracy. The results illustrated that PNCCs features have better performance compared to the MFCCs features to identify females compared to male speakers. Furthermore, feature wrapping reported better performance compared to the CMVN method. </span></p>

Download Full-text