Speech synthesis for a specific speaker based on a labeled speech database

Unit-Selection Speech Synthesis Method Using Words as Search Units

International Journal of Multimedia Data Engineering and Management ◽

10.4018/ijmdem.2016040104 ◽

2016 ◽

Vol 7 (2) ◽

pp. 53-67

Author(s):

Hiroyuki Segi

Keyword(s):

Conventional Method ◽

Speech Synthesis ◽

Synthesis Method ◽

Unit Selection ◽

Human Speech ◽

Subjective Evaluations ◽

Synthesized Speech ◽

Opinion Score ◽

Speech Database

Unit-selection speech-synthesis systems have been proposed. In most of the unit-selection speech-synthesis systems, search units are rather short such as syllables, phonemes and diphones. However, when applied to large speech databases, shorter units produce more voice-waveform candidates and a larger speech database cannot be used without narrow pruning for practical use. Narrow pruning impairs the quality of the synthesized speech. Here the author examined the possibility of using words as search units. Subjective evaluations indicated that 70% of the speech synthesized by the proposed method sounded more natural than that synthesized by a conventional method. The five-point mean opinion score of the synthesized speech was 3.5, and 21% was judged to sound as natural as human speech. These results demonstrate the effectiveness of unit-selection speech synthesis using words as search units.

Download Full-text

Method and system for training a text-to-speech synthesis system using a domain-specific speech database

The Journal of the Acoustical Society of America ◽

10.1121/1.3432305 ◽

2010 ◽

Vol 127 (5) ◽

pp. 3294

Author(s):

Horst Juergen Schroeter

Keyword(s):

Speech Synthesis ◽

Text To Speech ◽

Synthesis System ◽

Domain Specific ◽

Speech Database ◽

Text To Speech Synthesis

Download Full-text

GTM User Modeling for aIGA Weight Tuning in TTS Synthesis

Encyclopedia of Artificial Intelligence ◽

10.4018/978-1-59904-849-9.ch117 ◽

2011 ◽

pp. 788-795

Author(s):

Lluís Formiga ◽

Francesc Alías

Keyword(s):

Speech Synthesis ◽

Common Knowledge ◽

User Preferences ◽

Partial Ordering ◽

Synthetic Speech ◽

Generative Topographic Mapping ◽

Speech Database ◽

Interactive Genetic Algorithms ◽

Text To Speech Synthesis ◽

Single User

Unit Selection Text-to-Speech Synthesis (US-TTS) systems produce synthetic speech based on the retrieval of previous recorded speech units from a speech database (corpus) driven by a weighted cost function (Black & Campbell, 1995). To obtain high quality synthetic speech these weights must be optimized efficiently. To that effect, in previous works, a technique was introduced for weight tuning based on evolutionary perceptual tests by means of Active Interactive Genetic Algorithms (aiGAs) (Alías, Llorà, Formiga, Sastry & Goldberg, 2006) aiGAs mine models that map subjective preferences from users by partial ordering graphs, synthetic fitness and Evolutionary Computation (EC) (Llorà, Sastry, Goldberg, Gupta & Lakshmi, 2005). Although aiGA propose an effective method to map single user preferences, as far as we know, the methodology to extract common solutions among different individual preferences (hereafter denoted as common knowledge) has not been tackled yet. Furthermore, there is an ambiguity problem to be solved when different users evolve to different weight configurations. In this review, Generative Topographic Mapping (GTM) is introduced as a method to extract common knowledge from aiGA models obtained from user preferences.

Download Full-text

Unit selection in a concatenative speech synthesis system using a large speech database

1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings ◽

10.1109/icassp.1996.541110 ◽

2002 ◽

Cited By ~ 357

Author(s):

A.J. Hunt ◽

A.W. Black

Keyword(s):

Speech Synthesis ◽

Synthesis System ◽

Unit Selection ◽

Speech Database

Download Full-text

Outlier Detection and Removal for HMM-Based Speech Synthesis with an Insufficient Speech Database

IEICE Transactions on Information and Systems ◽

10.1587/transinf.e95.d.2351 ◽

2012 ◽

Vol E95.D (9) ◽

pp. 2351-2354 ◽

Cited By ~ 1

Author(s):

Doo Hwa HONG ◽

June Sig SUNG ◽

Kyung Hwan OH ◽

Nam Soo KIM

Keyword(s):

Outlier Detection ◽

Speech Synthesis ◽

Speech Database

Download Full-text

Text to Speech Synthesis using Fraction Based Waveform Concatenation and Optimal Coupling Smoothing Technique

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.a2530.059120 ◽

2020 ◽

Vol 9 (1) ◽

pp. 1764-1769

Keyword(s):

Speech Synthesis ◽

Smooth Transition ◽

Smoothing Technique ◽

Text To Speech ◽

Speech Database ◽

Corpus Size ◽

Text To Speech Synthesis ◽

Intelligible Speech ◽

Optimal Coupling ◽

Speech Segments

Text to Speech System is a Speech Synthesis application that converts a text to speech. The current project focuses on developing a TTS System for the Tamil Language with the Synthesis Technique as Unit Selection Synthesis. Letter Level Segmentation of an input text helps in the reduction of corpus size compared to Syllable Level Segmentation. The segmented units are retrieved with respect to Unicode values, concatenated and the synthesized speech is produced. Intelligibility and Naturalness of the spoken word can be improved using the Smoothing Techniques. Optimal Coupling Smoothing Technique is implemented for the smooth transition in between the concatenated speech segments to create continuous Speech output like human voice. Fraction based Waveform Concatenation method is used to produce the intelligible speech segments as output from the pre-recorded speech database.

Download Full-text

Slovak Speech Database for Experiments and Application Building in Unit-Selection Speech Synthesis

Text, Speech and Dialogue - Lecture Notes in Computer Science ◽

10.1007/978-3-540-30120-2_58 ◽

2004 ◽

pp. 457-464 ◽

Cited By ~ 1

Author(s):

Milan Rusko ◽

Marian Trnka ◽

Sachia Daržágín ◽

Miloš Cerňak

Keyword(s):

Speech Synthesis ◽

Unit Selection ◽

Speech Database

Download Full-text

Building a speech database for the purpose of speaker specific speech synthesis

Proceedings of Third International Conference on Signal Processing (ICSP'96) ◽

10.1109/icsigp.1996.567369 ◽

2002 ◽

Author(s):

R. Hoory ◽

N. Shaked ◽

D. Chazan

Keyword(s):

Speech Synthesis ◽

Speech Database

Download Full-text

A Realistic Visual Speech Synthesis for Indonesian Using A Combination of Morphing Viseme and Syllable Concatenation Approach to Support Pronunciation Learning

International Journal of Emerging Technologies in Learning (iJET) ◽

10.3991/ijet.v13i08.8084 ◽

2018 ◽

Vol 13 (08) ◽

pp. 19 ◽

Cited By ~ 2

Author(s):

_ Aripin ◽

Hanny Haryanto

Keyword(s):

Speech Synthesis ◽

Character Development ◽

Visual Speech ◽

Subjective Testing ◽

Visual Speech Synthesis ◽

Opinion Score ◽

Speech Database ◽

Calculation Results ◽

Natural Movement ◽

Score Method

This study aims to build a realistic visual speech synthesis for Indonesian so that it can be used to learn Indonesian pronunciation. In this study, We used the combination of morphing viseme and syllable concatenation method. The morphing viseme method is a process of deformation from one viseme to another so that the animation of the mouth shape looks smoother. This method is used to create the transition of animation between viseme. The Syllable Concatenation method is used to assemble viseme based on certain syllable patterns. We built a syllable-based voice database as a basis for synchronization between syllables, speech and viseme models. The method proposed in this study consists of several stages, namely the formation of Indonesian viseme models, designing facial animation character, development of speech database, a synchronization process and subjective testing of the resulting application. Subjective tests were conducted on 30 respondents who assessed the suitability and natural movement of the mouth when uttering the Indonesian texts. The MOS (Mean Opinion Score) method is used to calculate the average of respondents' scores. The MOS calculation results for the criteria of Synchronization and naturalness are 4,283 and 4,107 on the scale of 1 to 5. This result shows that the level of Synchronization and naturalness of the synthesis of visual speech is more realistic. Therefore, the system can display the visualization of phoneme pronunciation to support learning Indonesian pronunciation.

Download Full-text

Implementation and verification of speech database for unit selection speech synthesis

Proceedings of the 2017 Federated Conference on Computer Science and Information Systems ◽

10.15439/2017f395 ◽

2017 ◽

Cited By ~ 1

Author(s):

Krzysztof Szklanny ◽

Sebastian Koszuta

Keyword(s):

Speech Synthesis ◽

Unit Selection ◽

Speech Database

Download Full-text