Voice conversion with parallel/non-parallel data and synthetic speech detection

Advances in anti-spoofing: from the perspective of ASVspoof challenges

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2019.21 ◽

2020 ◽

Vol 9 ◽

Cited By ~ 5

Author(s):

Madhu R. Kamble ◽

Hardik B. Sailor ◽

Hemant A. Patil ◽

Haizhou Li

Keyword(s):

Deep Learning ◽

Speaker Verification ◽

Voice Conversion ◽

Synthetic Speech ◽

Acoustic Feature ◽

Speech Detection ◽

Feature Representations ◽

Biometric Systems ◽

Real World Applications ◽

Voice Biometrics

Abstract In recent years, automatic speaker verification (ASV) is used extensively for voice biometrics. This leads to an increased interest to secure these voice biometric systems for real-world applications. The ASV systems are vulnerable to various kinds of spoofing attacks, namely, synthetic speech (SS), voice conversion (VC), replay, twins, and impersonation. This paper provides the literature review of ASV spoof detection, novel acoustic feature representations, deep learning, end-to-end systems, etc. Furthermore, the paper also summaries previous studies of spoofing attacks with emphasis on SS, VC, and replay along with recent efforts to develop countermeasures for spoof speech detection (SSD) task. The limitations and challenges of SSD task are also presented. While several countermeasures were reported in the literature, they are mostly validated on a particular database, furthermore, their performance is far from perfect. The security of voice biometrics systems against spoofing attacks remains a challenging topic. This paper is based on a tutorial presented at APSIPA Annual Summit and Conference 2017 to serve as a quick start for those interested in the topic.

Download Full-text

Learning Efficient Representations for Fake Speech Detection

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6044 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5859-5866

Author(s):

Nishant Subramani ◽

Delip Rao

Keyword(s):

Transfer Learning ◽

Speech Synthesis ◽

Detection Performance ◽

Training Data ◽

Voice Conversion ◽

Synthetic Speech ◽

Model Parameters ◽

Speech Detection ◽

New Synthesis ◽

Alternative Approach

Synthetic speech or “fake speech” which matches personal vocal traits has become better and cheaper due to advances in deep learning-based speech synthesis and voice conversion approaches. This increased accessibility of synthetic speech systems and the growing misuse of them highlights the critical need to build countermeasures. Furthermore, new synthesis models evolve all the time and the efficacy of previously trained detection models on these unseen attack vectors is poor. In this paper, we focus on: 1) How can we build highly accurate, yet parameter and sample-efficient models for fake speech detection? 2) How can we rapidly adapt detection models to new sources of fake speech? We present four parameter-efficient convolutional architectures for fake speech detection with best detection F1 scores of around 97 points on a large dataset of fake and bonafide speech. We show how the fake speech detection task naturally lends itself to a novel multi-task problem further improving F1 scores for a mere 0.5% increase in model parameters. Our multi-task setting also helps in data-sparse situations, commonplace in adversarial settings. We investigate an alternative approach to the data-sparsity problem using transfer learning and show that it is possible to meet purely supervised detection performance for unseen attack vectors with as little as 6.25% of the training data. This is the first known application of transfer learning in adversarial settings for speech. Finally, we show how well our transfer learning approach adapts in an instance-efficient way to new attack vectors using the Real-Time Voice Cloning toolkit. We exceed the purely supervised detection performance (99.18 F1) with as little as 6.25% of the data.

Download Full-text

Average Modeling Approach to Voice Conversion with Non-Parallel Data

10.21437/odyssey.2018-32 ◽

2018 ◽

Cited By ~ 7

Author(s):

Xiaohai Tian ◽

Junchao Wang ◽

Haihua Xu ◽

Eng-Siong Chng ◽

Haizhou Li

Keyword(s):

Voice Conversion ◽

Modeling Approach ◽

Parallel Data

Download Full-text

Voice Spoofing Countermeasure for Synthetic Speech Detection

2021 International Conference on Artificial Intelligence (ICAI) ◽

10.1109/icai52203.2021.9445238 ◽

2021 ◽

Author(s):

Farman Hassan ◽

Ali Javed

Keyword(s):

Synthetic Speech ◽

Speech Detection

Download Full-text

Synthetic speech detection through short-term and long-term prediction traces

EURASIP Journal on Information Security ◽

10.1186/s13635-021-00116-3 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Clara Borrelli ◽

Paolo Bestagini ◽

Fabio Antonacci ◽

Augusto Sarti ◽

Stefano Tubaro

Keyword(s):

Deep Learning ◽

Speech Processing ◽

Synthetic Speech ◽

Opinion Formation ◽

Closed Set ◽

Speech Detection ◽

Open Set ◽

Technological Advances ◽

Speech Generation ◽

Long Term Prediction

AbstractSeveral methods for synthetic audio speech generation have been developed in the literature through the years. With the great technological advances brought by deep learning, many novel synthetic speech techniques achieving incredible realistic results have been recently proposed. As these methods generate convincing fake human voices, they can be used in a malicious way to negatively impact on today’s society (e.g., people impersonation, fake news spreading, opinion formation). For this reason, the ability of detecting whether a speech recording is synthetic or pristine is becoming an urgent necessity. In this work, we develop a synthetic speech detector. This takes as input an audio recording, extracts a series of hand-crafted features motivated by the speech-processing literature, and classify them in either closed-set or open-set. The proposed detector is validated on a publicly available dataset consisting of 17 synthetic speech generation algorithms ranging from old fashioned vocoders to modern deep learning solutions. Results show that the proposed method outperforms recently proposed detectors in the forensics literature.

Download Full-text

Non-parallel dictionary learning for voice conversion using non-negative Tucker decomposition

EURASIP Journal on Audio Speech and Music Processing ◽

10.1186/s13636-019-0160-1 ◽

2019 ◽

Vol 2019 (1) ◽

Author(s):

Yuki Takashima ◽

Toru Nakashika ◽

Tetsuya Takiguchi ◽

Yasuo Ariki

Keyword(s):

Dictionary Learning ◽

Computational Cost ◽

Tensor Decomposition ◽

Gaussian Mixture ◽

Voice Conversion ◽

Specific Information ◽

Learning Method ◽

Tucker Decomposition ◽

Parallel Data ◽

High Computational Cost

Abstract Voice conversion (VC) is a technique of exclusively converting speaker-specific information in the source speech while preserving the associated phonemic information. Non-negative matrix factorization (NMF)-based VC has been widely researched because of the natural-sounding voice it achieves when compared with conventional Gaussian mixture model-based VC. In conventional NMF-VC, models are trained using parallel data which results in the speech data requiring elaborate pre-processing to generate parallel data. NMF-VC also tends to be an extensive model as this method has several parallel exemplars for the dictionary matrix, leading to a high computational cost. In this study, an innovative parallel dictionary-learning method using non-negative Tucker decomposition (NTD) is proposed. The proposed method uses tensor decomposition and decomposes an input observation into a set of mode matrices and one core tensor. The proposed NTD-based dictionary-learning method estimates the dictionary matrix for NMF-VC without using parallel data. The experimental results show that the proposed method outperforms other methods in both parallel and non-parallel settings.

Download Full-text