U-Vectors: Generating Clusterable Speaker Embedding from Unlabeled Data

Muhammad Firoz Mridha; Abu Quwsar Ohi; Muhammad Mostafa Monowar; Md. Abdul Hamid; Md. Rashedul Islam; Yutaka Watanobe

doi:10.3390/app112110079

U-Vectors: Generating Clusterable Speaker Embedding from Unlabeled Data

Applied Sciences ◽

10.3390/app112110079 ◽

2021 ◽

Vol 11 (21) ◽

pp. 10079

Author(s):

Muhammad Firoz Mridha ◽

Abu Quwsar Ohi ◽

Muhammad Mostafa Monowar ◽

Md. Abdul Hamid ◽

Md. Rashedul Islam ◽

...

Keyword(s):

Speaker Recognition ◽

Large Scale ◽

English Language ◽

Domain Adaptation ◽

Recognition System ◽

Extraction Process ◽

Unlabeled Data ◽

Training Strategy ◽

Speech Segment ◽

Recognition Systems

Speaker recognition deals with recognizing speakers by their speech. Most speaker recognition systems are built upon two stages, the first stage extracts low dimensional correlation embeddings from speech, and the second performs the classification task. The robustness of a speaker recognition system mainly depends on the extraction process of speech embeddings, which are primarily pre-trained on a large-scale dataset. As the embedding systems are pre-trained, the performance of speaker recognition models greatly depends on domain adaptation policy, which may reduce if trained using inadequate data. This paper introduces a speaker recognition strategy dealing with unlabeled data, which generates clusterable embedding vectors from small fixed-size speech frames. The unsupervised training strategy involves an assumption that a small speech segment should include a single speaker. Depending on such a belief, a pairwise constraint is constructed with noise augmentation policies, used to train AutoEmbedder architecture that generates speaker embeddings. Without relying on domain adaption policy, the process unsupervisely produces clusterable speaker embeddings, termed unsupervised vectors (u-vectors). The evaluation is concluded in two popular speaker recognition datasets for English language, TIMIT, and LibriSpeech. Also, a Bengali dataset is included to illustrate the diversity of the domain shifts for speaker recognition systems. Finally, we conclude that the proposed approach achieves satisfactory performance using pairwise architectures.

Download Full-text

A Multiscale Chaotic Feature Extraction Method for Speaker Recognition

Complexity ◽

10.1155/2020/8810901 ◽

2020 ◽

Vol 2020 ◽

pp. 1-9

Author(s):

Jiang Lin ◽

Yi Yumei ◽

Zhang Maosheng ◽

Chen Defeng ◽

Wang Chao ◽

...

Keyword(s):

Feature Extraction ◽

Speaker Recognition ◽

Extraction Method ◽

State Of The Art ◽

Recognition System ◽

Nonlinear Dynamic Model ◽

Feature Extraction Method ◽

Analysis Technique ◽

Recognition Systems ◽

Environment Noise

In speaker recognition systems, feature extraction is a challenging task under environment noise conditions. To improve the robustness of the feature, we proposed a multiscale chaotic feature for speaker recognition. We use a multiresolution analysis technique to capture more finer information on different speakers in the frequency domain. Then, we extracted the speech chaotic characteristics based on the nonlinear dynamic model, which helps to improve the discrimination of features. Finally, we use a GMM-UBM model to develop a speaker recognition system. Our experimental results verified its good performance. Under clean speech and noise speech conditions, the ERR value of our method is reduced by 13.94% and 26.5% compared with the state-of-the-art method, respectively.

Download Full-text

A Structured Approach towards Robust Database Collection for Speaker Recognition

Global Journal of Enterprise Information System ◽

10.18311/gjeis/2017/16123 ◽

2017 ◽

Vol 9 (3) ◽

pp. 53 ◽

Cited By ~ 1

Author(s):

Pardeep Sangwan ◽

Saurabh Bhardwaj

Keyword(s):

Feature Extraction ◽

Speaker Recognition ◽

Recognition System ◽

Classification Methods ◽

Extraction Techniques ◽

Recognition Phase ◽

Speech Database ◽

Biometric Systems ◽

Recognition Systems ◽

Structured Approach

Speaker recognition systems are classified according to their database, feature extraction techniques and classification methods. It is analyzed that there is a much need to work upon all the dimensions of forensic speaker recognition systems from the very beginning phase of database collection to recognition phase. The present work provides a structured approach towards developing a robust speech database collection for efficient speaker recognition system. The database required for both systems is entirely different. The databases for biometric systems are readily available while databases for forensic speaker recognition system are scarce. The paper also presents several databases available for speaker recognition systems.

Download Full-text

Analisis Speaker Recognition Menggunakan Metode Dynamic Time Warping (DTW) Berbasis Matlab

AVITEC ◽

10.28989/avitec.v1i1.492 ◽

2019 ◽

Vol 1 (1) ◽

Author(s):

Noor Fita Indri Prayoga

Keyword(s):

Feature Extraction ◽

Speaker Recognition ◽

Dynamic Time Warping ◽

Recognition Accuracy ◽

Recognition System ◽

Extraction Process ◽

Test Results ◽

Time Warping ◽

Dynamic Time ◽

The Voice

Voice is one of way to communicate and express yourself. Speaker recognition is a process carried out by a device to recognize the speaker through the voice. This study designed a speaker recognition system that was able to identify speakers based on what was said by using dynamic time warping (DTW) method based in matlab. To design a speaker recognition system begins with the process of reference data and test data. Both processes have the same process, which starts with sound recording, preprocessing, and feature extraction. In this system, the Fast Fourier Transform (FFT) method is used to extract the features. The results of the feature extraction process from the two data will be compared using the DTW method. Calculations using DTW that produce the smallest value will be determined as the output. The test results show that the system can identify the voice with the best level of recognition accuracy of 90%, and the average recognition accuracy of 80%. The results were obtained from 50 tests, carried out by 5 people consisting of 3 men and 2 women, each speaker said a predetermined word

Download Full-text

When Speaker Recognition Meets Noisy Labels: Optimizations for Front-ends and Back-ends

10.36227/techrxiv.17121863.v2 ◽

2021 ◽

Author(s):

Lin Li ◽

Fuchuan Tong ◽

Qingyang Hong

Keyword(s):

Speaker Recognition ◽

Large Scale ◽

Recognition Performance ◽

Recognition System ◽

Correction Method ◽

Superior Performance ◽

Label Noise ◽

Practical Applications ◽

Front End ◽

Noisy Labels

A typical speaker recognition system often involves two modules: a feature extractor front-end and a speaker identity back-end. Despite the superior performance that deep neural networks have achieved for the front-end, their success benefits from the availability of large-scale and correctly labeled datasets. While label noise is unavoidable in speaker recognition datasets, both the front-end and back-end are affected by label noise, which degrades the speaker recognition performance. In this paper, we first conduct comprehensive experiments to help improve the understanding of the effects of label noise on both the front-end and back-end. Then, we propose a simple yet effective training paradigm and loss correction method to handle label noise for the front-end. We combine our proposed method with the recently proposed Bayesian estimation of PLDA for noisy labels, and the whole system shows strong robustness to label noise. Furthermore, we show two practical applications of the improved system: one application corrects noisy labels based on an utterance’s chunk-level predictions, and the other algorithmically filters out high-confidence noisy samples within a dataset. By applying the second application to the NIST SRE0410 dataset and verifying filtered utterances by human validation, we identify that approximately 1% of the SRE04-10 dataset is made up of label errors.

Download Full-text

Bilingual term recognition revisited

Terminology ◽

10.1075/term.16.2.01vin ◽

2010 ◽

Vol 16 (2) ◽

pp. 141-158 ◽

Cited By ~ 14

Author(s):

Spela Vintar

Keyword(s):

English Language ◽

Recognition System ◽

Single Word ◽

Average Precision ◽

Comparable Corpora ◽

Domain Specific ◽

Recognition Systems ◽

Statistical Ranking ◽

Multiple Translations ◽

Language Pair

The paper describes LUIZ, a bilingual term recognition system that has been developed for the Slovene-English language pair. The system is a hybrid term extractor using morphosyntactic patterns and statistical ranking to propose domain-specific expressions for each of the two languages, whereupon translation equivalents between the languages are identified using the innovative bag-of-equivalents approach. This simple but effective method is based on the Twente word aligner to obtain a lexicon of single word translation pairs and their probability scores, which is then used to identify correspondences between multi-word terms. The bilingual term recognition system has been tested and evaluated on three parallel subcorpora from the tourism, accounting and military domain. Average precision of the term alignment component is 0.83, whereby only fully equivalent and domain-relevant terms were counted as positives. Another advantage of the described approach is the fact that we successfully detect term variants and multiple translations of a candidate multi-word term. Since our term alignment method does not require sentence-aligned corpora it can be used with comparable corpora, provided we already have a domain-specific lexicon or dictionary of single-word correspondences. The paper concludes with some thoughts on the users of term recognition systems and their needs based on our observations from the online version of the system.

Download Full-text

Speaker Recognition System based on Age-related Features using Convolutional and Deep Neural Networks

10.21203/rs.2.23454/v1 ◽

2020 ◽

Author(s):

Karthika Kuppusamy ◽

Chandra Eswaran

Keyword(s):

Neural Network ◽

Speaker Recognition ◽

Speech Rate ◽

Voice Recognition ◽

Gaussian Mixture ◽

Recognition System ◽

Support Vector ◽

Age Related ◽

Novel Approach ◽

Recognition Systems

Abstract With the advent of conversational voice recognition systems growing such as Alexa, SIRI, OK Google, etc., natural language conversational systems including Chatbot and voice recognition systems are in new high and determining the age of a speaker is critical for setting the pertinent context. Age can be inferred from the speech signal by inferring various factors such as physical attributes of voice, linguistic attributes, frequency, speech rate,etc., The proposed research article discusses about extracting the spectral features of speech such as Cepstral Coefficients, Spectral Decrease, Centroid, Flatness, Spectral Entropy, F0DIFF, Jitter and Shimmer as inputs. This would help in classifying speaker age through deep learning techniques. A novel approach is addressed along with the model for implementation using Deep Neural Network and Convolutional Neural Network for classifying the features using three different classifiers which are Gaussian Mixture Model (GMM), Support Vector Machine (SVM) and GMM-SVM. The results obtained from the proposed system would outline the performance in speaker age recognition.

Download Full-text

Speaker Recognition Systems: A Tutorial

African Journal of Information & Communication Technology ◽

10.5130/ajict.v3i2.508 ◽

2007 ◽

Vol 3 (2) ◽

Cited By ~ 1

Author(s):

Abimbola Fisusi ◽

Thomas Yesufu

Keyword(s):

Speech Processing ◽

Speaker Recognition ◽

Recognition Task ◽

Recognition System ◽

Research Trends ◽

Specific Information ◽

Recognition Systems ◽

To Come ◽

Processing Techniques

Abstract This paper gives an overview of speaker recognition systems. Speaker recognition is the task of automatically recognizing who is speaking by identifying an unknown speaker among several reference speakers using speaker-specific information included in speech waves. The different classification of speaker recognition and speech processing techniques required for performing the recognition task are discussed. The basic modules of a speaker recognition system are outlined and discussed. Some of the techniques required to implement each module of the system were discussed and others are mentioned. The methods were also compared with one another. Finally, this paper concludes by giving a few research trends in speaker recognition for some year to come.

Download Full-text

Domain adaptation via within-class covariance correction in I-vector based speaker recognition systems

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2014.6854359 ◽

2014 ◽

Cited By ~ 21

Author(s):

Ondrej Glembek ◽

Jeff Ma ◽

Pavel Matejka ◽

Bing Zhang ◽

Oldrich Plchot ◽

...

Keyword(s):

Speaker Recognition ◽

Domain Adaptation ◽

Recognition Systems

Download Full-text

SVM AGAINST GMM/SVM FOR DIALECT INFLUENCE ON AUTOMATIC SPEAKER RECOGNITION TASK

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026814500126 ◽

2014 ◽

Vol 13 (02) ◽

pp. 1450012 ◽

Cited By ~ 1

Author(s):

KAWTHAR YASMINE ZERGAT ◽

ABDERRAHMANE AMROUCHE

Keyword(s):

Speaker Recognition ◽

Speaker Verification ◽

Recognition Task ◽

Principal Component ◽

Recognition System ◽

Automatic Speaker Recognition ◽

Modeling Techniques ◽

Recognition Systems ◽

Speaking Style ◽

Speaker Modeling

A big deal for current research on automatic speaker recognition is the effectiveness of the speaker modeling techniques for the talkers, because they have their own speaking style, depending on their specific accents and dialects. This paper investigates on the influence of the dialect and the size of database on the text independent speaker verification task using the SVM and the hybrid GMM/SVM speaker modeling. The Principal Component Analysis (PCA) technique is used in the front-end part of the speaker recognition system, in order to extract the most representative features. Experimental results show that the size of database has an important impact on the SVM and GMM/SVM based speaker verification performances, while the dialect has no significant effect. Applying PCA dimensionality reduction improves the recognition accuracy for both SVM and GMM/SVM based recognition systems. However, it did not give an obvious observation about the dialect effect.

Download Full-text

Research and Implementation of Speaker Recognition System

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.416-417.1331 ◽

2013 ◽

Vol 416-417 ◽

pp. 1331-1335 ◽

Cited By ~ 1

Author(s):

Bang Jun Cui

Keyword(s):

Feature Extraction ◽

Information Security ◽

Convenient Method ◽

Speaker Recognition ◽

Recognition System ◽

Extraction Process ◽

Economic Benefits ◽

Feature Parameters ◽

System Operator

language is the most natural to pass information, the most effective and most convenient method, at the same time, because everyone there are differences in pronunciation organs and habits, by the speaker's voice to identify the identity will make process more convenient and effective. Its application can not only make the system of information security has been further guaranteed, you can also bring great convenience to users, increase the economic benefits of system operator. This paper first introduces the principle and the research of speaker recognition system, and then discussed the pretreatment of the feature parameters in speaker recognition system and the feature extraction process, finally discusses the implementation results and the test of speaker recognition system, shows that the system run correctly, the performance good, to achieve the desired effect.

Download Full-text