scholarly journals U-Vectors: Generating Clusterable Speaker Embedding from Unlabeled Data

2021 ◽  
Vol 11 (21) ◽  
pp. 10079
Author(s):  
Muhammad Firoz Mridha ◽  
Abu Quwsar Ohi ◽  
Muhammad Mostafa Monowar ◽  
Md. Abdul Hamid ◽  
Md. Rashedul Islam ◽  
...  

Speaker recognition deals with recognizing speakers by their speech. Most speaker recognition systems are built upon two stages, the first stage extracts low dimensional correlation embeddings from speech, and the second performs the classification task. The robustness of a speaker recognition system mainly depends on the extraction process of speech embeddings, which are primarily pre-trained on a large-scale dataset. As the embedding systems are pre-trained, the performance of speaker recognition models greatly depends on domain adaptation policy, which may reduce if trained using inadequate data. This paper introduces a speaker recognition strategy dealing with unlabeled data, which generates clusterable embedding vectors from small fixed-size speech frames. The unsupervised training strategy involves an assumption that a small speech segment should include a single speaker. Depending on such a belief, a pairwise constraint is constructed with noise augmentation policies, used to train AutoEmbedder architecture that generates speaker embeddings. Without relying on domain adaption policy, the process unsupervisely produces clusterable speaker embeddings, termed unsupervised vectors (u-vectors). The evaluation is concluded in two popular speaker recognition datasets for English language, TIMIT, and LibriSpeech. Also, a Bengali dataset is included to illustrate the diversity of the domain shifts for speaker recognition systems. Finally, we conclude that the proposed approach achieves satisfactory performance using pairwise architectures.

Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-9
Author(s):  
Jiang Lin ◽  
Yi Yumei ◽  
Zhang Maosheng ◽  
Chen Defeng ◽  
Wang Chao ◽  
...  

In speaker recognition systems, feature extraction is a challenging task under environment noise conditions. To improve the robustness of the feature, we proposed a multiscale chaotic feature for speaker recognition. We use a multiresolution analysis technique to capture more finer information on different speakers in the frequency domain. Then, we extracted the speech chaotic characteristics based on the nonlinear dynamic model, which helps to improve the discrimination of features. Finally, we use a GMM-UBM model to develop a speaker recognition system. Our experimental results verified its good performance. Under clean speech and noise speech conditions, the ERR value of our method is reduced by 13.94% and 26.5% compared with the state-of-the-art method, respectively.


2017 ◽  
Vol 9 (3) ◽  
pp. 53 ◽  
Author(s):  
Pardeep Sangwan ◽  
Saurabh Bhardwaj

<p>Speaker recognition systems are classified according to their database, feature extraction techniques and classification methods. It is analyzed that there is a much need to work upon all the dimensions of forensic speaker recognition systems from the very beginning phase of database collection to recognition phase. The present work provides a structured approach towards developing a robust speech database collection for efficient speaker recognition system. The database required for both systems is entirely different. The databases for biometric systems are readily available while databases for forensic speaker recognition system are scarce. The paper also presents several databases available for speaker recognition systems.</p><p> </p>


AVITEC ◽  
2019 ◽  
Vol 1 (1) ◽  
Author(s):  
Noor Fita Indri Prayoga

Voice is one of  way to communicate and express yourself. Speaker recognition is a process carried out by a device to recognize the speaker through the voice. This study designed a speaker recognition system that was able to identify speakers based on what was said by using dynamic time warping (DTW) method based in matlab. To design a speaker recognition system begins with the process of reference data and test data. Both processes have the same process, which starts with sound recording, preprocessing, and feature extraction. In this system, the Fast Fourier Transform (FFT) method is used to extract the features. The results of the feature extraction process from the two data will be compared using the DTW method. Calculations using DTW that produce the smallest value will be determined as the output. The test results show that the system can identify the voice with the best level of recognition accuracy of 90%, and the average recognition accuracy of 80%. The results were obtained from 50 tests, carried out by 5 people consisting of 3 men and 2 women, each speaker said a predetermined word


2021 ◽  
Author(s):  
Lin Li ◽  
Fuchuan Tong ◽  
Qingyang Hong

A typical speaker recognition system often involves two modules: a feature extractor front-end and a speaker identity back-end. Despite the superior performance that deep neural networks have achieved for the front-end, their success benefits from the availability of large-scale and correctly labeled datasets. While label noise is unavoidable in speaker recognition datasets, both the front-end and back-end are affected by label noise, which degrades the speaker recognition performance. In this paper, we first conduct comprehensive experiments to help improve the understanding of the effects of label noise on both the front-end and back-end. Then, we propose a simple yet effective training paradigm and loss correction method to handle label noise for the front-end. We combine our proposed method with the recently proposed Bayesian estimation of PLDA for noisy labels, and the whole system shows strong robustness to label noise. Furthermore, we show two practical applications of the improved system: one application corrects noisy labels based on an utterance’s chunk-level predictions, and the other algorithmically filters out high-confidence noisy samples within a dataset. By applying the second application to the NIST SRE0410 dataset and verifying filtered utterances by human validation, we identify that approximately 1% of the SRE04-10 dataset is made up of label errors.<br>


Terminology ◽  
2010 ◽  
Vol 16 (2) ◽  
pp. 141-158 ◽  
Author(s):  
Spela Vintar

The paper describes LUIZ, a bilingual term recognition system that has been developed for the Slovene-English language pair. The system is a hybrid term extractor using morphosyntactic patterns and statistical ranking to propose domain-specific expressions for each of the two languages, whereupon translation equivalents between the languages are identified using the innovative bag-of-equivalents approach. This simple but effective method is based on the Twente word aligner to obtain a lexicon of single word translation pairs and their probability scores, which is then used to identify correspondences between multi-word terms. The bilingual term recognition system has been tested and evaluated on three parallel subcorpora from the tourism, accounting and military domain. Average precision of the term alignment component is 0.83, whereby only fully equivalent and domain-relevant terms were counted as positives. Another advantage of the described approach is the fact that we successfully detect term variants and multiple translations of a candidate multi-word term. Since our term alignment method does not require sentence-aligned corpora it can be used with comparable corpora, provided we already have a domain-specific lexicon or dictionary of single-word correspondences. The paper concludes with some thoughts on the users of term recognition systems and their needs based on our observations from the online version of the system.


2020 ◽  
Author(s):  
Karthika Kuppusamy ◽  
Chandra Eswaran

Abstract With the advent of conversational voice recognition systems growing such as Alexa, SIRI, OK Google, etc., natural language conversational systems including Chatbot and voice recognition systems are in new high and determining the age of a speaker is critical for setting the pertinent context. Age can be inferred from the speech signal by inferring various factors such as physical attributes of voice, linguistic attributes, frequency, speech rate,etc., The proposed research article discusses about extracting the spectral features of speech such as Cepstral Coefficients, Spectral Decrease, Centroid, Flatness, Spectral Entropy, F0DIFF, Jitter and Shimmer as inputs. This would help in classifying speaker age through deep learning techniques. A novel approach is addressed along with the model for implementation using Deep Neural Network and Convolutional Neural Network for classifying the features using three different classifiers which are Gaussian Mixture Model (GMM), Support Vector Machine (SVM) and GMM-SVM. The results obtained from the proposed system would outline the performance in speaker age recognition.


Author(s):  
Abimbola Fisusi ◽  
Thomas Yesufu

Abstract This paper gives an overview of speaker recognition systems. Speaker recognition is the task of automatically recognizing who is speaking by identifying an unknown speaker among several reference speakers using speaker-specific information included in speech waves. The different classification of speaker recognition and speech processing techniques required for performing the recognition task are discussed. The basic modules of a speaker recognition system are outlined and discussed. Some of the techniques required to implement each module of the system were discussed and others are mentioned. The methods were also compared with one another. Finally, this paper concludes by giving a few research trends in speaker recognition for some year to come.


Author(s):  
KAWTHAR YASMINE ZERGAT ◽  
ABDERRAHMANE AMROUCHE

A big deal for current research on automatic speaker recognition is the effectiveness of the speaker modeling techniques for the talkers, because they have their own speaking style, depending on their specific accents and dialects. This paper investigates on the influence of the dialect and the size of database on the text independent speaker verification task using the SVM and the hybrid GMM/SVM speaker modeling. The Principal Component Analysis (PCA) technique is used in the front-end part of the speaker recognition system, in order to extract the most representative features. Experimental results show that the size of database has an important impact on the SVM and GMM/SVM based speaker verification performances, while the dialect has no significant effect. Applying PCA dimensionality reduction improves the recognition accuracy for both SVM and GMM/SVM based recognition systems. However, it did not give an obvious observation about the dialect effect.


2013 ◽  
Vol 416-417 ◽  
pp. 1331-1335 ◽  
Author(s):  
Bang Jun Cui

language is the most natural to pass information, the most effective and most convenient method, at the same time, because everyone there are differences in pronunciation organs and habits, by the speaker's voice to identify the identity will make process more convenient and effective. Its application can not only make the system of information security has been further guaranteed, you can also bring great convenience to users, increase the economic benefits of system operator. This paper first introduces the principle and the research of speaker recognition system, and then discussed the pretreatment of the feature parameters in speaker recognition system and the feature extraction process, finally discusses the implementation results and the test of speaker recognition system, shows that the system run correctly, the performance good, to achieve the desired effect.


Sign in / Sign up

Export Citation Format

Share Document