Voice source and vocal tract variations as cues to emotional states perceived from expressive conversational speech

Author(s):  
Hiroki Mori ◽  
Hideki Kasuya
Author(s):  
Filipa M. B. Lã ◽  
Brian P. Gill

Singing performance is highly competitive; thus, finding strategies to accelerate the acquisition of knowledge that results in an efficient and effective vocal technique is of the utmost importance. There are many ways in which a singer may acquire an efficient and effective vocal technique, which can be based on the physiological processes of voice production. This chapter explores these processes within the context of singing performance. The authors examine three major aspects of singing: 1) efficient control of breathing, such that optimal airflow and subglottal pressure are available as needed, for a given frequency and intensity; 2) maximized laryngeal coordination, so that the voice source signal contains all the necessary frequency components for the desired tone; and 3) the modulation of the source signal by subtle shaping of the vocal tract. The advantages and disadvantages of various pedagogical methods are discussed, including breath management, known as appoggio, and different resonant strategies. The authors advocate for a scientifically-grounded teaching method, which allows for physiological differences between individuals, genders, and voice classifications.


Author(s):  
Masoud Geravanchizadeh ◽  
Elnaz Forouhandeh ◽  
Meysam Bashirpour

AbstractThe performance of speech recognition systems trained with neutral utterances degrades significantly when these systems are tested with emotional speech. Since everybody can speak emotionally in the real-world environment, it is necessary to take account of the emotional states of speech in the performance of the automatic speech recognition system. Limited works have been performed in the field of emotion-affected speech recognition and so far, most of the researches have focused on the classification of speech emotions. In this paper, the vocal tract length normalization method is employed to enhance the robustness of the emotion-affected speech recognition system. For this purpose, two structures of the speech recognition system based on hybrids of hidden Markov model with Gaussian mixture model and deep neural network are used. To achieve this goal, frequency warping is applied to the filterbank and/or discrete-cosine transform domain(s) in the feature extraction process of the automatic speech recognition system. The warping process is conducted in a way to normalize the emotional feature components and make them close to their corresponding neutral feature components. The performance of the proposed system is evaluated in neutrally trained/emotionally tested conditions for different speech features and emotional states (i.e., Anger, Disgust, Fear, Happy, and Sad). In this system, frequency warping is employed for different acoustical features. The constructed emotion-affected speech recognition system is based on the Kaldi automatic speech recognition with the Persian emotional speech database and the crowd-sourced emotional multi-modal actors dataset as the input corpora. The experimental simulations reveal that, in general, the warped emotional features result in better performance of the emotion-affected speech recognition system as compared with their unwarped counterparts. Also, it can be seen that the performance of the speech recognition using the deep neural network-hidden Markov model outperforms the system employing the hybrid with the Gaussian mixture model.


Author(s):  
Christian T. Herbst ◽  
David M. Howard ◽  
Jan G. Švec

The voice instrument is composed of three basic sub-systems: the pulmonary apparatus, the laryngeal voice source, and the vocal tract for sound modification. In this chapter, the laryngeal sound generation is examined in closer detail, with a special focus on singing voice production. In particular, the relation between the quality of vocal fold vibration, the consistence of the glottal airflow, and the spectral composition of the resulting laryngeal sound output (before being filtered by the vocal tract) is discussed. Two basic physiological parameters for controlling these features are described: cartilaginous adduction (controlled along the dimension of “breathy” vs. “pressed” voice); and membranous medialization (influenced by the choice of singing voice register). It is shown that these two physiological parameters can be varied independently, and how they can be incorporated into a pedagogical model. Based on this model, a typical application from the singing studio is described. Finally, the range of sound qualities resulting from independent variation of cartilaginous adduction and membranous medialization is being commented on by five known voice pedagogues, in an attempt to unify the respective terminology in voice pedagogy.


1992 ◽  
Vol 92 (4) ◽  
pp. 2301-2301
Author(s):  
J. H. Eggen ◽  
S. G. Nooteboom ◽  
A. J. M. Houtsma
Keyword(s):  

Author(s):  
Johan Sundberg

The sound quality of singing is determined by three basic factors—the air pressure under the vocal folds (or the subglottal pressure), the mechanical properties of the vocal folds, and the resonance properties of the vocal tract. Subglottal pressure is controlled by the respiratory apparatus. It regulates vocal loudness and is varied with pitch in singing. Together with the mechanical properties of the folds, which are controlled by laryngeal muscles, it has a decisive influence on vocal fold vibrationswhich convert the tracheal airstream to a pulsating airflow, the voice source. The voice source determines pitch, vibrato, and register, and also the overall slope of the spectrum. The sound of the voice source is filtered by the resonances of the vocal tract, or the formants, of which the two lowest determine the vowel quality and the higher ones the personal voice quality. Timing is crucial for creating emotional expressivity; it uses an acoustic code that shows striking similarities to that used in speech. The perceived loudness of a vowel sound seems more closely related to the subglottal pressure with which it was produced than with the acoustical sound level. Some investigations of acoustical correlates of tone placement and variation of larynx height are described, as are properties that affect the perceived naturalness of synthesized singing. Finally, subglottal pressure, voice source, and formant-frequency characteristics of some non-classical styles of singing are discussed.


Sign in / Sign up

Export Citation Format

Share Document