Speech synthesis for a specific speaker based on a labeled speech database

Author(s):  
R. Hoory ◽  
D. Chazan
Author(s):  
Hiroyuki Segi

Unit-selection speech-synthesis systems have been proposed. In most of the unit-selection speech-synthesis systems, search units are rather short such as syllables, phonemes and diphones. However, when applied to large speech databases, shorter units produce more voice-waveform candidates and a larger speech database cannot be used without narrow pruning for practical use. Narrow pruning impairs the quality of the synthesized speech. Here the author examined the possibility of using words as search units. Subjective evaluations indicated that 70% of the speech synthesized by the proposed method sounded more natural than that synthesized by a conventional method. The five-point mean opinion score of the synthesized speech was 3.5, and 21% was judged to sound as natural as human speech. These results demonstrate the effectiveness of unit-selection speech synthesis using words as search units.


Author(s):  
Lluís Formiga ◽  
Francesc Alías

Unit Selection Text-to-Speech Synthesis (US-TTS) systems produce synthetic speech based on the retrieval of previous recorded speech units from a speech database (corpus) driven by a weighted cost function (Black & Campbell, 1995). To obtain high quality synthetic speech these weights must be optimized efficiently. To that effect, in previous works, a technique was introduced for weight tuning based on evolutionary perceptual tests by means of Active Interactive Genetic Algorithms (aiGAs) (Alías, Llorà, Formiga, Sastry & Goldberg, 2006) aiGAs mine models that map subjective preferences from users by partial ordering graphs, synthetic fitness and Evolutionary Computation (EC) (Llorà, Sastry, Goldberg, Gupta & Lakshmi, 2005). Although aiGA propose an effective method to map single user preferences, as far as we know, the methodology to extract common solutions among different individual preferences (hereafter denoted as common knowledge) has not been tackled yet. Furthermore, there is an ambiguity problem to be solved when different users evolve to different weight configurations. In this review, Generative Topographic Mapping (GTM) is introduced as a method to extract common knowledge from aiGA models obtained from user preferences.


2012 ◽  
Vol E95.D (9) ◽  
pp. 2351-2354 ◽  
Author(s):  
Doo Hwa HONG ◽  
June Sig SUNG ◽  
Kyung Hwan OH ◽  
Nam Soo KIM

2020 ◽  
Vol 9 (1) ◽  
pp. 1764-1769

Text to Speech System is a Speech Synthesis application that converts a text to speech. The current project focuses on developing a TTS System for the Tamil Language with the Synthesis Technique as Unit Selection Synthesis. Letter Level Segmentation of an input text helps in the reduction of corpus size compared to Syllable Level Segmentation. The segmented units are retrieved with respect to Unicode values, concatenated and the synthesized speech is produced. Intelligibility and Naturalness of the spoken word can be improved using the Smoothing Techniques. Optimal Coupling Smoothing Technique is implemented for the smooth transition in between the concatenated speech segments to create continuous Speech output like human voice. Fraction based Waveform Concatenation method is used to produce the intelligible speech segments as output from the pre-recorded speech database.


Author(s):  
_ Aripin ◽  
Hanny Haryanto

This study aims to build a realistic visual speech synthesis for Indonesian so that it can be used to learn Indonesian pronunciation. In this study, We used the combination of morphing viseme and syllable concatenation method. The morphing viseme method is a process of deformation from one viseme to another so that the animation of the mouth shape looks smoother. This method is used to create the transition of animation between viseme. The Syllable Concatenation method is used to assemble viseme based on certain syllable patterns. We built a syllable-based voice database as a basis for synchronization between syllables, speech and viseme models. The method proposed in this study consists of several stages, namely the formation of Indonesian viseme models, designing facial animation character, development of speech database, a synchronization process and subjective testing of the resulting application. Subjective tests were conducted on 30 respondents who assessed the suitability and natural movement of the mouth when uttering the Indonesian texts. The MOS (Mean Opinion Score) method is used to calculate the average of respondents' scores. The MOS calculation results for the criteria of Synchronization and naturalness are 4,283 and 4,107 on the scale of 1 to 5. This result shows that the level of Synchronization and naturalness of the synthesis of visual speech is more realistic. Therefore, the system can display the visualization of phoneme pronunciation to support learning Indonesian pronunciation.


Sign in / Sign up

Export Citation Format

Share Document