The Intelligibility of Synthesized Speech

Julia Hoover; Joe Reichle; Dianne Van Tasell; David Cole

doi:10.1044/jshr.3003.425

The Intelligibility of Synthesized Speech

Journal of Speech Language and Hearing Research ◽

10.1044/jshr.3003.425 ◽

1987 ◽

Vol 30 (3) ◽

pp. 425-431 ◽

Cited By ~ 56

Author(s):

Julia Hoover ◽

Joe Reichle ◽

Dianne Van Tasell ◽

David Cole

Keyword(s):

Speech Synthesis ◽

Linear Trend ◽

Language Impairments ◽

Practice Effect ◽

Natural Speech ◽

Communication Aids ◽

Sentence Condition ◽

Synthesized Speech ◽

Preceding Context ◽

Low Probability

The intelligibility of two speech synthesizers [ECHO II (Street Electronics, 1982) and VOTRAX (VOTRAX Division, 1981)] was compared to the intelligibility of natural speech in each of three different contextual conditions: (a) single words, (b)"low-probability sentences" in which the last word could not be predicted from preceding context, and (c) "high-probability sentences" in which the last word could be predicted from preceding context. Additionally, the effect of practice on performance in each condition was examined. Natural speech was more intelligible than either type of synthesized speech regardless of word/sentence condition. In both sentence conditions, VOTRAX speech was significantly more intelligible than ECHO II speech. No practice effect was observed for VOTRAX, while an ascending linear trend occurred for ECHO II. Implications for the use of inexpensive speech synthesis units as components of augmentative communication aids for persons with severe speech and/or language impairments are discussed.

Download Full-text

Effects of Discourse Context on the Intelligibility of Synthesized Speech for Young Adult and Older Adult Listeners

Journal of Speech Language and Hearing Research ◽

10.1044/1092-4388(2001/083) ◽

2001 ◽

Vol 44 (5) ◽

pp. 1052-1057 ◽

Cited By ~ 20

Author(s):

Kathryn D. R. Drager ◽

Joe E. Reichle

Keyword(s):

Young Adult ◽

Older Adult ◽

Speech Signal ◽

Augmentative And Alternative Communication ◽

Speech Synthesis ◽

Electronic Communication ◽

Communication Aids ◽

Alternative Communication ◽

Discourse Context ◽

Synthesized Speech

The use of speech synthesis in electronic communication aids allows individuals who use augmentative and alternative communication (AAC) devices to communicate with a variety of partners. However, communication will only be effective if the speech signal is readily understood by the listener. The intelligibility of synthesized speech is influenced by a variety of factors, including the provision of context. Although the facilitative effects of context have been demonstrated extensively in studies with young adults, there are few investigations into older adults' ability to decode the synthesized speech signal. The present study investigated whether discourse context affected the intelligibility of synthesized sentences for young adult and older adult listeners. Listeners were asked to repeat 15-word sentences that were either presented in isolation or preceded by a story that set the context for the sentence. Participants correctly repeated significantly more words in the sentences when they were preceded by related sentences than when the sentences were presented in isolation. This research shows a facilitating effect of context in discourse, wherein previous words and sentences are related to later sentences, for both younger and older adult listeners. These results have direct implications for AAC system message transmission.

Download Full-text

Unit-Selection Speech Synthesis Method Using Words as Search Units

International Journal of Multimedia Data Engineering and Management ◽

10.4018/ijmdem.2016040104 ◽

2016 ◽

Vol 7 (2) ◽

pp. 53-67

Author(s):

Hiroyuki Segi

Keyword(s):

Conventional Method ◽

Speech Synthesis ◽

Synthesis Method ◽

Unit Selection ◽

Human Speech ◽

Subjective Evaluations ◽

Synthesized Speech ◽

Opinion Score ◽

Speech Database

Unit-selection speech-synthesis systems have been proposed. In most of the unit-selection speech-synthesis systems, search units are rather short such as syllables, phonemes and diphones. However, when applied to large speech databases, shorter units produce more voice-waveform candidates and a larger speech database cannot be used without narrow pruning for practical use. Narrow pruning impairs the quality of the synthesized speech. Here the author examined the possibility of using words as search units. Subjective evaluations indicated that 70% of the speech synthesized by the proposed method sounded more natural than that synthesized by a conventional method. The five-point mean opinion score of the synthesized speech was 3.5, and 21% was judged to sound as natural as human speech. These results demonstrate the effectiveness of unit-selection speech synthesis using words as search units.

Download Full-text

Personalized natural speech synthesis based on retrieval of pitch patterns using hierarchical Fujisaki model

2013 IEEE International Conference on Acoustics, Speech and Signal Processing ◽

10.1109/icassp.2013.6639191 ◽

2013 ◽

Author(s):

Yi-Chin Huang ◽

Chung-Hsien Wu ◽

Shih-Lun Lin

Keyword(s):

Speech Synthesis ◽

Natural Speech ◽

Fujisaki Model ◽

Pitch Patterns

Download Full-text

LSTM Deep Neural Networks Postfiltering for Enhancing Synthetic Voices

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s021800141860008x ◽

2017 ◽

Vol 32 (01) ◽

pp. 1860008 ◽

Cited By ~ 8

Author(s):

Marvin Coto-Jiménez ◽

John Goddard-Close

Keyword(s):

Neural Networks ◽

Speech Synthesis ◽

Deep Neural Networks ◽

Short Term Memory ◽

Markov Models ◽

Natural Speech ◽

Objective Measures ◽

Recent Developments ◽

Small Footprint ◽

Synthetic Voices

Recent developments in speech synthesis have produced systems capable of producing speech which closely resembles natural speech, and researchers now strive to create models that more accurately mimic human voices. One such development is the incorporation of multiple linguistic styles in various languages and accents. Speech synthesis based on Hidden Markov Models (HMM) is of great interest to researchers, due to its ability to produce sophisticated features with a small footprint. Despite some progress, its quality has not yet reached the level of the current predominant unit-selection approaches, which select and concatenate recordings of real speech, and work has been conducted to try to improve HMM-based systems. In this paper, we present an application of long short-term memory (LSTM) deep neural networks as a postfiltering step in HMM-based speech synthesis. Our motivation stems from a similar desire to obtain characteristics which are closer to those of natural speech. The paper analyzes four types of postfilters obtained using five voices, which range from a single postfilter to enhance all the parameters, to a multi-stream proposal which separately enhances groups of parameters. The different proposals are evaluated using three objective measures and are statistically compared to determine any significance between them. The results described in the paper indicate that HMM-based voices can be enhanced using this approach, specially for the multi-stream postfilters on the considered objective measures.

Download Full-text

Phonetic convergence in the shadowing for natural and synthesized speech in Polish

Lingua Posnaniensis ◽

10.2478/linpo-2020-0008 ◽

2020 ◽

Vol 62 (2) ◽

pp. 7-17

Author(s):

Karolina Jankowska ◽

Tomasz Kuczmarski ◽

Grażyna Demenko

Keyword(s):

Great Majority ◽

Natural Speech ◽

Phonetic Feature ◽

Perceptual Analysis ◽

Synthesized Speech ◽

Phonetic Convergence ◽

Polish Language ◽

Group 3

Abstract The matter of shadowing natural speech has been discussed in many studies and papers. However, there is very little knowledge of human phonetical convergence to synthesized speech. To find out more about this issue an experiment in the Polish language was conducted. Two types of stimuli were used – natural speech and synthesised speech. Five sets of sentences with various phonetic phenomena in Polish were prepared. A group of twenty persons were recorded which gave the total number of 100 samples for each phenomenon. The summary of results shows convergence in both natural and synthesised speech in set number 1, 2, 4 while in group 3 and 5 the convergence was not observed. The baseline production shown that the great majority of participants prefer ɛn/ɛm version of phonetic feature which was reflected in 83 out of 100 sentences. In the shadowing natural speech participants changed ɛn/ɛm to ɛw/ɛ̃ in 26 cases and in 4 ɛw/ɛ̃ to ɛn/ɛm. When shadowing synthesised speech shift from ɛn/ɛm to ɛw/ɛ̃ in 18 sentences and 4 from ɛw/ɛ̃ to ɛn/ɛm. The intonation convergence was also observed in the perceptual analysis, however the analysis of F0 statistics did not show statistically significant differences.

Download Full-text

Visually Impaired Persons’ Comprehension of Text Presented with Speech Synthesis

Journal of Visual Impairment & Blindness ◽

10.1177/0145482x9208601005 ◽

1992 ◽

Vol 86 (10) ◽

pp. 426-428 ◽

Cited By ~ 2

Author(s):

E. Hjelmquist ◽

U. Dahlstrand; ◽

L. Hedelin

Keyword(s):

Visually Impaired ◽

Speech Synthesis ◽

Synthesis Condition ◽

Natural Speech ◽

Middle Aged ◽

Marginal Effects ◽

Visually Impaired Persons

Three groups of visually impaired persons (two middle aged and one old) were investigated with respect to memory and understanding of texts presented with speech synthesis and natural speech, respectively. The results showed that speech synthesis generally yielded lower results than did natural speech. Experience had no effect on performance, and there were only marginal effects related to age. However, there were big differences among the groups with respect to the presentation speed chosen in the speech-synthesis condition.

Download Full-text

Spontaneous speech synthesis by pronunciation variant selection - a comparison to natural speech

10.21437/interspeech.2007-498 ◽

2007 ◽

Author(s):

Steffen Werner ◽

Rüdiger Hoffmann

Keyword(s):

Speech Synthesis ◽

Spontaneous Speech ◽

Natural Speech ◽

Variant Selection

Download Full-text

A Comparison of Learning Curves in Natural and Synthesized Speech Comprehension

Journal of Speech Language and Hearing Research ◽

10.1044/1092-4388(2002/065) ◽

2002 ◽

Vol 45 (4) ◽

pp. 802-810 ◽

Cited By ~ 11

Author(s):

Mary E. Reynolds ◽

Charlene Isaacs-Duvall ◽

Michelle Lynn Haddox

Keyword(s):

Young Adults ◽

Human Factors ◽

Learning Curves ◽

Natural Speech ◽

Speech Comprehension ◽

Verification Task ◽

High Quality ◽

Response Latencies ◽

Synthesized Speech

This study examined the effect of listening practice on the ability of young adults to comprehend natural speech and DECtalk synthesized speech by having them perform a sentence verification task over a 5-day period. Results showed that response latencies of participants shortened in a similar fashion to sentences presented in both types of speech across the 5-day period, with latencies remaining significantly longer in response to DECtalk than to natural speech across the days. These results suggest that high-quality synthesized speech, such as DECtalk, can be useful in many human factors applications.

Download Full-text

Natural speech synthesis based on hybrid approach with candidate expansion and verification

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2014.6853596 ◽

2014 ◽

Cited By ~ 2

Author(s):

Chung-Hsien Wu ◽

Yi-Chin Huang ◽

Shih-Lun Lin ◽

Chia-Ping Chen

Keyword(s):

Speech Synthesis ◽

Hybrid Approach ◽

Natural Speech

Download Full-text

Language-Impaired Children's Comprehension of Synthesized Speech

Language Speech and Hearing Services in Schools ◽

10.1044/0161-1461.1904.401 ◽

1988 ◽

Vol 19 (4) ◽

pp. 401-409 ◽

Cited By ~ 16

Author(s):

Holly J. Massey

Keyword(s):

Task Complexity ◽

Natural Speech ◽

Control Group ◽

Language Impaired ◽

Control Subjects ◽

Synthesized Speech ◽

Token Test ◽

Impaired Children ◽

Experimental Group ◽

Age And Sex

The Token Test for Children was given in a synthesized-speech version and a natural-speech version to 11 language-impaired children aged 8 years, 9 months to 10 years, 1 month and to 11 control subjects matched for age and sex. The scores of the language-impaired children on the synthesized version were significantly lower than (a) the synthesized-speech scores of the control group and (b) their own scores on the natural-speech version. Task complexity was a significant factor for the experimental group. Language-impaired children may have difficulty understanding some synthesized voice commands.

Download Full-text