scholarly journals Voice Quality Modelling for Expressive Speech Synthesis

2014 ◽  
Vol 2014 ◽  
pp. 1-12 ◽  
Author(s):  
Carlos Monzo ◽  
Ignasi Iriondo ◽  
Joan Claudi Socoró

This paper presents the perceptual experiments that were carried out in order to validate the methodology of transforming expressive speech styles using voice quality (VoQ) parameters modelling, along with the well-known prosody (F0, duration, and energy), from a neutral style into a number of expressive ones. The main goal was to validate the usefulness of VoQ in the enhancement of expressive synthetic speech in terms of speech quality and style identification. A harmonic plus noise model (HNM) was used to modify VoQ and prosodic parameters that were extracted from an expressive speech corpus. Perception test results indicated the improvement of obtained expressive speech styles using VoQ modelling along with prosodic characteristics.

1970 ◽  
Vol 109 (3) ◽  
pp. 105-108 ◽  
Author(s):  
A. Kajackas ◽  
A. Anskaitis ◽  
D. Gursnys

In this paper, method for evaluation of varying conversational speech quality in wireless communications is proposed. The proposed algorithm evaluates quality degradations using indicators based on count of lost frames and voice activity indications. The correctness of proposed algorithm is investigated by comparison of test results with results obtained using PESQ algorithm under same conditions. The achieved average correlation coefficient is 0.975. This result is independent of frame loss model and percentage of silence in test sentences. Proposed algorithm can be implemented in mobile stations and used for speech quality evaluation by real conversation. Ill. 3, bibl. 13, tabl. 3 (in English; abstracts in English and Lithuanian).http://dx.doi.org/10.5755/j01.eee.109.3.182


Author(s):  
Keikichi Hirose

After starting as an effort to mimic the human process of speech sound generation, the quality of synthetic speech has reached a level that makes it difficult to notice that it is synthetic. This owes to the development of waveform concatenation methods which select the most appropriate speech segments from a huge speech corpus. Although the lack of flexibility in producing various speech qualities/styles has been pointed out, this problem is about to be solved by introducing statistical frameworks into parametric speech synthesis. Now, a speaker can even speak a foreign language in his/her voice using advanced voice-conversion techniques. However, if we consider prosodic features of speech, current technologies are not appropriate to handle their hierarchical structure over a long time span. Introduction of prosody modelling into the speech-synthesis process is necessary. In this chapter, after viewing the history of voice/speech synthesis, technologies are explained, starting from text-to-speech and concept-to-speech conversion. Then, methods of sound generation are introduced. Statistical parametric speech synthesis, especially HMM-based speech synthesis, is introduced as a technology that enables flexible speech synthesis—that is, synthetic speech with various qualities/styles requiring a smaller amount of speech corpus. After that, the problem of frame-by-frame processing for prosodic features is addressed and the importance of prosody modelling is pointed out. Prosodic (fundamental frequency) modelling is surveyed and, finally, the generation process model is introduced with some experimental results when applied to HMM-based speech synthesis.


Author(s):  
Xixin Wu ◽  
Yuewen Cao ◽  
Mu Wang ◽  
Songxiang Liu ◽  
Shiyin Kang ◽  
...  

2008 ◽  
Author(s):  
Grazyna Demenko ◽  
J. Bachan ◽  
Bernd Möbius ◽  
K. Klessa ◽  
M. Szymański ◽  
...  

2021 ◽  
pp. 2150022
Author(s):  
Caio Cesar Enside de Abreu ◽  
Marco Aparecido Queiroz Duarte ◽  
Bruno Rodrigues de Oliveira ◽  
Jozue Vieira Filho ◽  
Francisco Villarreal

Speech processing systems are very important in different applications involving speech and voice quality such as automatic speech recognition, forensic phonetics and speech enhancement, among others. In most of them, the acoustic environmental noise is added to the original signal, decreasing the signal-to-noise ratio (SNR) and the speech quality by consequence. Therefore, estimating noise is one of the most important steps in speech processing whether to reduce it before processing or to design robust algorithms. In this paper, a new approach to estimate noise from speech signals is presented and its effectiveness is tested in the speech enhancement context. For this purpose, partial least squares (PLS) regression is used to model the acoustic environment (AE) and a Wiener filter based on a priori SNR estimation is implemented to evaluate the proposed approach. Six noise types are used to create seven acoustically modeled noises. The basic idea is to consider the AE model to identify the noise type and estimate its power to be used in a speech processing system. Speech signals processed using the proposed method and classical noise estimators are evaluated through objective measures. Results show that the proposed method produces better speech quality than state-of-the-art noise estimators, enabling it to be used in real-time applications in the field of robotic, telecommunications and acoustic analysis.


2002 ◽  
Vol 45 (4) ◽  
pp. 689-699 ◽  
Author(s):  
Donald G. Jamieson ◽  
Vijay Parsa ◽  
Moneca C. Price ◽  
James Till

We investigated how standard speech coders, currently used in modern communication systems, affect the quality of the speech of persons who have common speech and voice disorders. Three standardized speech coders (GSM 6.10 RPELTP, FS1016 CELP, and FS1015 LPC) and two speech coders based on subband processing were evaluated for their performance. Coder effects were assessed by measuring the quality of speech samples both before and after processing by the speech coders. Speech quality was rated by 10 listeners with normal hearing on 28 different scales representing pitch and loudness changes, speech rate, laryngeal and resonatory dysfunction, and coder-induced distortions. Results showed that (a) nine scale items were consistently and reliably rated by the listeners; (b) all coders degraded speech quality on these nine scales, with the GSM and CELP coders providing the better quality speech; and (c) interactions between coders and individual voices did occur on several voice quality scales.


Sign in / Sign up

Export Citation Format

Share Document