The Intelligibility of Synthesized Speech

1987 ◽  
Vol 30 (3) ◽  
pp. 425-431 ◽  
Author(s):  
Julia Hoover ◽  
Joe Reichle ◽  
Dianne Van Tasell ◽  
David Cole

The intelligibility of two speech synthesizers [ECHO II (Street Electronics, 1982) and VOTRAX (VOTRAX Division, 1981)] was compared to the intelligibility of natural speech in each of three different contextual conditions: (a) single words, (b)"low-probability sentences" in which the last word could not be predicted from preceding context, and (c) "high-probability sentences" in which the last word could be predicted from preceding context. Additionally, the effect of practice on performance in each condition was examined. Natural speech was more intelligible than either type of synthesized speech regardless of word/sentence condition. In both sentence conditions, VOTRAX speech was significantly more intelligible than ECHO II speech. No practice effect was observed for VOTRAX, while an ascending linear trend occurred for ECHO II. Implications for the use of inexpensive speech synthesis units as components of augmentative communication aids for persons with severe speech and/or language impairments are discussed.

2001 ◽  
Vol 44 (5) ◽  
pp. 1052-1057 ◽  
Author(s):  
Kathryn D. R. Drager ◽  
Joe E. Reichle

The use of speech synthesis in electronic communication aids allows individuals who use augmentative and alternative communication (AAC) devices to communicate with a variety of partners. However, communication will only be effective if the speech signal is readily understood by the listener. The intelligibility of synthesized speech is influenced by a variety of factors, including the provision of context. Although the facilitative effects of context have been demonstrated extensively in studies with young adults, there are few investigations into older adults' ability to decode the synthesized speech signal. The present study investigated whether discourse context affected the intelligibility of synthesized sentences for young adult and older adult listeners. Listeners were asked to repeat 15-word sentences that were either presented in isolation or preceded by a story that set the context for the sentence. Participants correctly repeated significantly more words in the sentences when they were preceded by related sentences than when the sentences were presented in isolation. This research shows a facilitating effect of context in discourse, wherein previous words and sentences are related to later sentences, for both younger and older adult listeners. These results have direct implications for AAC system message transmission.


Author(s):  
Hiroyuki Segi

Unit-selection speech-synthesis systems have been proposed. In most of the unit-selection speech-synthesis systems, search units are rather short such as syllables, phonemes and diphones. However, when applied to large speech databases, shorter units produce more voice-waveform candidates and a larger speech database cannot be used without narrow pruning for practical use. Narrow pruning impairs the quality of the synthesized speech. Here the author examined the possibility of using words as search units. Subjective evaluations indicated that 70% of the speech synthesized by the proposed method sounded more natural than that synthesized by a conventional method. The five-point mean opinion score of the synthesized speech was 3.5, and 21% was judged to sound as natural as human speech. These results demonstrate the effectiveness of unit-selection speech synthesis using words as search units.


Author(s):  
Marvin Coto-Jiménez ◽  
John Goddard-Close

Recent developments in speech synthesis have produced systems capable of producing speech which closely resembles natural speech, and researchers now strive to create models that more accurately mimic human voices. One such development is the incorporation of multiple linguistic styles in various languages and accents. Speech synthesis based on Hidden Markov Models (HMM) is of great interest to researchers, due to its ability to produce sophisticated features with a small footprint. Despite some progress, its quality has not yet reached the level of the current predominant unit-selection approaches, which select and concatenate recordings of real speech, and work has been conducted to try to improve HMM-based systems. In this paper, we present an application of long short-term memory (LSTM) deep neural networks as a postfiltering step in HMM-based speech synthesis. Our motivation stems from a similar desire to obtain characteristics which are closer to those of natural speech. The paper analyzes four types of postfilters obtained using five voices, which range from a single postfilter to enhance all the parameters, to a multi-stream proposal which separately enhances groups of parameters. The different proposals are evaluated using three objective measures and are statistically compared to determine any significance between them. The results described in the paper indicate that HMM-based voices can be enhanced using this approach, specially for the multi-stream postfilters on the considered objective measures.


2020 ◽  
Vol 62 (2) ◽  
pp. 7-17
Author(s):  
Karolina Jankowska ◽  
Tomasz Kuczmarski ◽  
Grażyna Demenko

Abstract The matter of shadowing natural speech has been discussed in many studies and papers. However, there is very little knowledge of human phonetical convergence to synthesized speech. To find out more about this issue an experiment in the Polish language was conducted. Two types of stimuli were used – natural speech and synthesised speech. Five sets of sentences with various phonetic phenomena in Polish were prepared. A group of twenty persons were recorded which gave the total number of 100 samples for each phenomenon. The summary of results shows convergence in both natural and synthesised speech in set number 1, 2, 4 while in group 3 and 5 the convergence was not observed. The baseline production shown that the great majority of participants prefer ɛn/ɛm version of phonetic feature which was reflected in 83 out of 100 sentences. In the shadowing natural speech participants changed ɛn/ɛm to ɛw/ɛ̃ in 26 cases and in 4 ɛw/ɛ̃ to ɛn/ɛm. When shadowing synthesised speech shift from ɛn/ɛm to ɛw/ɛ̃ in 18 sentences and 4 from ɛw/ɛ̃ to ɛn/ɛm. The intonation convergence was also observed in the perceptual analysis, however the analysis of F0 statistics did not show statistically significant differences.


1992 ◽  
Vol 86 (10) ◽  
pp. 426-428 ◽  
Author(s):  
E. Hjelmquist ◽  
U. Dahlstrand; ◽  
L. Hedelin

Three groups of visually impaired persons (two middle aged and one old) were investigated with respect to memory and understanding of texts presented with speech synthesis and natural speech, respectively. The results showed that speech synthesis generally yielded lower results than did natural speech. Experience had no effect on performance, and there were only marginal effects related to age. However, there were big differences among the groups with respect to the presentation speed chosen in the speech-synthesis condition.


2002 ◽  
Vol 45 (4) ◽  
pp. 802-810 ◽  
Author(s):  
Mary E. Reynolds ◽  
Charlene Isaacs-Duvall ◽  
Michelle Lynn Haddox

This study examined the effect of listening practice on the ability of young adults to comprehend natural speech and DECtalk synthesized speech by having them perform a sentence verification task over a 5-day period. Results showed that response latencies of participants shortened in a similar fashion to sentences presented in both types of speech across the 5-day period, with latencies remaining significantly longer in response to DECtalk than to natural speech across the days. These results suggest that high-quality synthesized speech, such as DECtalk, can be useful in many human factors applications.


1988 ◽  
Vol 19 (4) ◽  
pp. 401-409 ◽  
Author(s):  
Holly J. Massey

The Token Test for Children was given in a synthesized-speech version and a natural-speech version to 11 language-impaired children aged 8 years, 9 months to 10 years, 1 month and to 11 control subjects matched for age and sex. The scores of the language-impaired children on the synthesized version were significantly lower than (a) the synthesized-speech scores of the control group and (b) their own scores on the natural-speech version. Task complexity was a significant factor for the experimental group. Language-impaired children may have difficulty understanding some synthesized voice commands.


Sign in / Sign up

Export Citation Format

Share Document