scholarly journals Learning to Fool the Speaker Recognition

Author(s):  
Jiguo Li ◽  
Xinfeng Zhang ◽  
Jizheng Xu ◽  
Siwei Ma ◽  
Wen Gao

Due to the widespread deployment of fingerprint/face/speaker recognition systems, the risk in these systems, especially the adversarial attack, has drawn increasing attention in recent years. Previous researches mainly studied the adversarial attack to the vision-based systems, such as fingerprint and face recognition. While the attack for speech-based systems has not been well studied yet, although it has been widely used in our daily life. In this article, we attempt to fool the state-of-the-art speaker recognition model and present speaker recognition attacker , a lightweight multi-layer convolutional neural network to fool the well-trained state-of-the-art speaker recognition model by adding imperceptible perturbations onto the raw speech waveform. We find that the speaker recognition system is vulnerable to the adversarial attack, and achieve a high success rate on both the non-targeted attack and targeted attack. Besides, we present an effective method by leveraging a pretrained phoneme recognition model to optimize the speaker recognition attacker to obtain a tradeoff between the attack success rate and the perceptual quality. Experimental results on the TIMIT and LibriSpeech datasets demonstrate the effectiveness and efficiency of our proposed model. And the experiments for frequency analysis indicate that high-frequency attack is more effective than low-frequency attack, which is different from the conclusion drawn in previous image-based works. Additionally, the ablation study gives more insights into our model.

Complexity ◽  
2020 ◽  
Vol 2020 ◽  
pp. 1-9
Author(s):  
Jiang Lin ◽  
Yi Yumei ◽  
Zhang Maosheng ◽  
Chen Defeng ◽  
Wang Chao ◽  
...  

In speaker recognition systems, feature extraction is a challenging task under environment noise conditions. To improve the robustness of the feature, we proposed a multiscale chaotic feature for speaker recognition. We use a multiresolution analysis technique to capture more finer information on different speakers in the frequency domain. Then, we extracted the speech chaotic characteristics based on the nonlinear dynamic model, which helps to improve the discrimination of features. Finally, we use a GMM-UBM model to develop a speaker recognition system. Our experimental results verified its good performance. Under clean speech and noise speech conditions, the ERR value of our method is reduced by 13.94% and 26.5% compared with the state-of-the-art method, respectively.


Author(s):  
Xiangpeng Li ◽  
Jingkuan Song ◽  
Lianli Gao ◽  
Xianglong Liu ◽  
Wenbing Huang ◽  
...  

Most of the recent progresses on visual question answering are based on recurrent neural networks (RNNs) with attention. Despite the success, these models are often timeconsuming and having difficulties in modeling long range dependencies due to the sequential nature of RNNs. We propose a new architecture, Positional Self-Attention with Coattention (PSAC), which does not require RNNs for video question answering. Specifically, inspired by the success of self-attention in machine translation task, we propose a Positional Self-Attention to calculate the response at each position by attending to all positions within the same sequence, and then add representations of absolute positions. Therefore, PSAC can exploit the global dependencies of question and temporal information in the video, and make the process of question and video encoding executed in parallel. Furthermore, in addition to attending to the video features relevant to the given questions (i.e., video attention), we utilize the co-attention mechanism by simultaneously modeling “what words to listen to” (question attention). To the best of our knowledge, this is the first work of replacing RNNs with selfattention for the task of visual question answering. Experimental results of four tasks on the benchmark dataset show that our model significantly outperforms the state-of-the-art on three tasks and attains comparable result on the Count task. Our model requires less computation time and achieves better performance compared with the RNNs-based methods. Additional ablation study demonstrates the effect of each component of our proposed model.


Due to the highly variant face geometry and appearances, Facial Expression Recognition (FER) is still a challenging problem. CNN can characterize 2-D signals. Therefore, for emotion recognition in a video, the authors propose a feature selection model in AlexNet architecture to extract and filter facial features automatically. Similarly, for emotion recognition in audio, the authors use a deep LSTM-RNN. Finally, they propose a probabilistic model for the fusion of audio and visual models using facial features and speech of a subject. The model combines all the extracted features and use them to train the linear SVM (Support Vector Machine) classifiers. The proposed model outperforms the other existing models and achieves state-of-the-art performance for audio, visual and fusion models. The model classifies the seven known facial expressions, namely anger, happy, surprise, fear, disgust, sad, and neutral on the eNTERFACE’05 dataset with an overall accuracy of 76.61%.


Author(s):  
Buyu Li ◽  
Yu Liu ◽  
Xiaogang Wang

Despite the great success of two-stage detectors, single-stage detector is still a more elegant and efficient way, yet suffers from the two well-known disharmonies during training, i.e. the huge difference in quantity between positive and negative examples as well as between easy and hard examples. In this work, we first point out that the essential effect of the two disharmonies can be summarized in term of the gradient. Further, we propose a novel gradient harmonizing mechanism (GHM) to be a hedging for the disharmonies. The philosophy behind GHM can be easily embedded into both classification loss function like cross-entropy (CE) and regression loss function like smooth-L1 (SL1) loss. To this end, two novel loss functions called GHM-C and GHM-R are designed to balancing the gradient flow for anchor classification and bounding box refinement, respectively. Ablation study on MS COCO demonstrates that without laborious hyper-parameter tuning, both GHM-C and GHM-R can bring substantial improvement for single-stage detector. Without any whistles and bells, the proposed model achieves 41.6 mAP on COCO testdev set which surpass the state-of-the-art method, Focal Loss (FL) + SL1, by 0.8. The code1 is released to facilitate future research.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Raghad Tariq Al-Hassani ◽  
Dogu Cagdas Atilla ◽  
Çağatay Aydin

Speech signal is enriched with plenty of features used for biometrical recognition and other applications like gender and emotional recognition. Channel conditions manifested by background noise and reverberation are the main challenges causing feature shifts in the test and training data. In this paper, a hybrid speaker identification model for consistent speech features and high recognition accuracy is made. Features using Mel frequency spectrum coefficients (MFCC) have been improved by incorporating a pitch frequency coefficient from speech time domain analysis. In order to enhance noise immunity, we proposed a single hidden layer feed-forward neural network (FFNN) tuned by an optimized particle swarm optimization (OPSO) algorithm. The proposed model is tested using 10-fold cross-validation over different levels of Adaptive White Gaussian Noise (AWGN) (0-50 dB). A recognition accuracy of 97.83% was obtained from the proposed model in clean voice environments. However, a noisy channel is realized with lesser impact on the proposed model as compared with other baseline classifiers such as plain-FFNN, random forest (RF), K -nearest neighbour (KNN), and support vector machine (SVM).


Author(s):  
Halim Sayoud ◽  
Siham Ouamour

Most existing systems of speaker recognition use “state of the art” acoustic features. However, many times one can only recognize a speaker by his or her prosodic features, especially by the accent. For this reason, the authors investigate some pertinent prosodic features that can be associated with other classic acoustic features, in order to improve the recognition accuracy. The authors have developed a new prosodic model using a modified LVQ (Learning Vector Quantization) algorithm, which is called MLVQ (Modified LVQ). This model is composed of three reduced prosodic features: the mean of the pitch, original duration, and low-frequency energy. Since these features are heterogeneous, a new optimized metric has been proposed that is called Optimized Distance for Heterogeneous Features (ODHEF). Tests of speaker identification are done on Arabic corpus because the NIST evaluations showed that speaker verification scores depend on the spoken language and that some of the worst scores were got for the Arabic language. Experimental results show good performances of the new prosodic approach.


2016 ◽  
Vol 2016 ◽  
pp. 1-11 ◽  
Author(s):  
Lei Lei ◽  
She Kun

An important application of speaker recognition is forensics. However, the accuracy of speaker recognition in forensic cases often drops off rapidly because of the ill effect of ambient noise, variable channel, different duration of speech data, and so on. Therefore, finding a robust speaker recognition model is very important for forensics. This paper builds a new speaker recognition model based on wavelet cepstral coefficient (WCC), i-vector, and cosine distance scoring (CDS). This model firstly uses the WCC to transform the speech into spectral feature vecors and then uses those spectral feature vectors to train the i-vectors that represent the speeches having different durations. CDS is used to compare the i-vectors to give out the evidence. Moreover, linear discriminant analysis (LDA) and the within-class covariance normalization (WCNN) are added to the CDS algorithm to deal with the channel variability problem. Finally, the likelihood ratio estimates the strength of the evidence. We use the TIMIT database to evaluate the performance of the proposed model. The experimental results show that the proposed model can effectively solve the troubles of forensic scenario, but the time cost of the method is high.


Author(s):  
Anand Handa ◽  
Rashi Agarwal ◽  
Narendra Kohli

Due to the highly variant face geometry and appearances, Facial Expression Recognition (FER) is still a challenging problem. CNN can characterize 2-D signals. Therefore, for emotion recognition in a video, the authors propose a feature selection model in AlexNet architecture to extract and filter facial features automatically. Similarly, for emotion recognition in audio, the authors use a deep LSTM-RNN. Finally, they propose a probabilistic model for the fusion of audio and visual models using facial features and speech of a subject. The model combines all the extracted features and use them to train the linear SVM (Support Vector Machine) classifiers. The proposed model outperforms the other existing models and achieves state-of-the-art performance for audio, visual and fusion models. The model classifies the seven known facial expressions, namely anger, happy, surprise, fear, disgust, sad, and neutral on the eNTERFACE’05 dataset with an overall accuracy of 76.61%.


2019 ◽  
Vol 8 (2) ◽  
pp. 6429-6432

Speaker recognition is the task in which the speaker is identified based on various features from his speech. Speaker recognition is combination of various mathematical operations in which training and testing is the major part. For speaker recognition its very important to extract the features. So far many researches are going on about feature extraction techniques like MFCC, IMFCC etc. In which features can be extracted, but for the exact speaker recognition its very important to get the exact and accurate features so that we can increase the success rate of speaker recognition. For any speaker recognition system feature extraction is the primary and very important step. So the precise result depends on the accurate result of feature extraction technique. In this paper we are proposing a modified feature extraction system.


2018 ◽  
Author(s):  
I Wayan Agus Surya Darma

Balinese character recognition is a technique to recognize feature or pattern of Balinese character. Feature of Balinese character is generated through feature extraction process. This research using handwritten Balinese character. Feature extraction is a process to obtain the feature of character. In this research, feature extraction process generated semantic and direction feature of handwritten Balinese character. Recognition is using K-Nearest Neighbor algorithm to recognize 81 handwritten Balinese character. The feature of Balinese character images tester are compared with reference features. Result of the recognition system with K=3 and reference=10 is achieved a success rate of 97,53%.


Sign in / Sign up

Export Citation Format

Share Document