Accented Speech Recognition Based on End-to-End Domain Adversarial Training of Neural Networks

Hyeong-Ju Na; Jeong-Sik Park

doi:10.3390/app11188412

Accented Speech Recognition Based on End-to-End Domain Adversarial Training of Neural Networks

Applied Sciences ◽

10.3390/app11188412 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8412

Author(s):

Hyeong-Ju Na ◽

Jeong-Sik Park

Keyword(s):

Speech Recognition ◽

Domain Adaptation ◽

Training Data ◽

Baseline Model ◽

Linguistic Differences ◽

Computational Costs ◽

Accented Speech ◽

Feature Extractor ◽

Adversarial Training ◽

End To End

The performance of automatic speech recognition (ASR) may be degraded when accented speech is recognized because the speech has some linguistic differences from standard speech. Conventional accented speech recognition studies have utilized the accent embedding method, in which the accent embedding features are directly fed into the ASR network. Although the method improves the performance of accented speech recognition, it has some restrictions, such as increasing the computational costs. This study proposes an efficient method of training the ASR model for accented speech in a domain adversarial way based on the Domain Adversarial Neural Network (DANN). The DANN plays a role as a domain adaptation in which the training data and test data have different distributions. Thus, our approach is expected to construct a reliable ASR model for accented speech by reducing the distribution differences between accented speech and standard speech. DANN has three sub-networks: the feature extractor, the domain classifier, and the label predictor. To adjust the DANN for accented speech recognition, we constructed these three sub-networks independently, considering the characteristics of accented speech. In particular, we used an end-to-end framework based on Connectionist Temporal Classification (CTC) to develop the label predictor, a very important module that directly affects ASR results. To verify the efficiency of the proposed approach, we conducted several experiments of accented speech recognition for four English accents including Australian, Canadian, British (England), and Indian accents. The experimental results showed that the proposed DANN-based model outperformed the baseline model for all accents, indicating that the end-to-end domain adversarial training effectively reduced the distribution differences between accented speech and standard speech.

Download Full-text

Domain Adversarial Training for Accented Speech Recognition

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2018.8462663 ◽

2018 ◽

Cited By ~ 16

Author(s):

Sining Sun ◽

Ching-Feng Yeh ◽

Mei-Yuh Hwang ◽

Mari Ostendorf ◽

Lei Xie

Keyword(s):

Speech Recognition ◽

Accented Speech ◽

Adversarial Training

Download Full-text

Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2019.8683602 ◽

2019 ◽

Cited By ~ 5

Author(s):

Alexander H. Liu ◽

Hung-yi Lee ◽

Lin-shan Lee

Keyword(s):

Speech Recognition ◽

Language Model ◽

Adversarial Training ◽

End To End

Download Full-text

Domain Adaptation of End-to-end Speech Recognition in Low-Resource Settings

2018 IEEE Spoken Language Technology Workshop (SLT) ◽

10.1109/slt.2018.8639506 ◽

2018 ◽

Cited By ~ 1

Author(s):

Lahiru Samarakoon ◽

Brian Mak ◽

Albert Y.S. Lam

Keyword(s):

Speech Recognition ◽

Domain Adaptation ◽

Low Resource Settings ◽

Low Resource ◽

End To End

Download Full-text

Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) ◽

10.1109/asru46091.2019.9003776 ◽

2019 ◽

Cited By ~ 1

Author(s):

Zhong Meng ◽

Jinyu Li ◽

Yashesh Gaur ◽

Yifan Gong

Keyword(s):

Speech Recognition ◽

Student Learning ◽

Domain Adaptation ◽

Teacher Student ◽

End To End

Download Full-text

End-to-End Accented Speech Recognition

10.21437/interspeech.2019-2122 ◽

2019 ◽

Cited By ~ 2

Author(s):

Thibault Viglino ◽

Petr Motlicek ◽

Milos Cernak

Keyword(s):

Speech Recognition ◽

Accented Speech ◽

End To End

Download Full-text

Generative Adversarial Training Data Adaptation for Very Low-Resource Automatic Speech Recognition

10.21437/interspeech.2020-1195 ◽

2020 ◽

Author(s):

Kohei Matsuura ◽

Masato Mimura ◽

Shinsuke Sakai ◽

Tatsuya Kawahara

Keyword(s):

Speech Recognition ◽

Automatic Speech Recognition ◽

Training Data ◽

Low Resource ◽

Adversarial Training

Download Full-text

Text and Synthetic Data for Domain Adaptation in End-to-End Speech Recognition

10.1007/978-3-030-87802-3_25 ◽

2021 ◽

pp. 271-278

Author(s):

Juan Hussain ◽

Christian Huber ◽

Sebastian Stüker ◽

Alexander Waibel

Keyword(s):

Speech Recognition ◽

Domain Adaptation ◽

Synthetic Data ◽

End To End

Download Full-text

Combining End-to-End and Adversarial Training for Low-Resource Speech Recognition

2018 IEEE Spoken Language Technology Workshop (SLT) ◽

10.1109/slt.2018.8639541 ◽

2018 ◽

Cited By ~ 2

Author(s):

Jennifer Drexler ◽

James Glass

Keyword(s):

Speech Recognition ◽

Low Resource ◽

Adversarial Training ◽

End To End

Download Full-text

Comparison of domain adaptation techniques for white matter hyperintensity segmentation in brain MR images

10.1101/2021.03.12.435171 ◽

2021 ◽

Author(s):

Vaanathi Sundaresan ◽

Giovanna Zamboni ◽

Nicola K. Dinsdale ◽

Peter M. Rothwell ◽

Ludovica Griffanti ◽

...

Keyword(s):

White Matter ◽

Transfer Learning ◽

Domain Adaptation ◽

White Matter Hyperintensity ◽

Mr Images ◽

Target Domain ◽

Baseline Model ◽

Adversarial Learning ◽

Adversarial Training ◽

Brain Mr Images

AbstractRobust automated segmentation of white matter hyperintensities (WMHs) in different datasets (domains) is highly challenging due to differences in acquisition (scanner, sequence), population (WMH amount and location) and limited availability of manual segmentations to train supervised algorithms. In this work we explore various domain adaptation techniques such as transfer learning and domain adversarial learning methods, including domain adversarial neural networks and domain unlearning, to improve the generalisability of our recently proposed triplanar ensemble network, which is our baseline model. We evaluated the domain adaptation techniques on source and target domains consisting of 5 different datasets with variations in intensity profile, lesion characteristics and acquired using different scanners. For transfer learning, we also studied various training options such as minimal number of unfrozen layers and subjects required for finetuning in the target domain. On comparing the performance of different techniques on the target dataset, unsupervised domain adversarial training of neural network gave the best performance, making the technique promising for robust WMH segmentation.

Download Full-text

Co-attention CNNs for Unsupervised Object Co-segmentation

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/104 ◽

2018 ◽

Cited By ~ 14

Author(s):

Kuang-Jui Hsu ◽

Yen-Yu Lin ◽

Yung-Yu Chuang

Keyword(s):

State Of The Art ◽

The State ◽

Training Data ◽

Specific Class ◽

Feature Extractor ◽

Ground Segmentation ◽

Supervised Methods ◽

The Common ◽

End To End ◽

Image Object

Object co-segmentation aims to segment the common objects in images. This paper presents a CNN-based method that is unsupervised and end-to-end trainable to better solve this task. Our method is unsupervised in the sense that it does not require any training data in the form of object masks but merely a set of images jointly covering objects of a specific class. Our method comprises two collaborative CNN modules, a feature extractor and a co-attention map generator. The former module extracts the features of the estimated objects and backgrounds, and is derived based on the proposed co-attention loss which minimizes inter-image object discrepancy while maximizing intra-image figure-ground separation. The latter module is learned to generated co-attention maps by which the estimated figure-ground segmentation can better fit the former module. Besides, the co-attention loss, the mask loss is developed to retain the whole objects and remove noises. Experiments show that our method achieves superior results, even outperforming the state-of-the-art, supervised methods.

Download Full-text