Linguistic features weighting for a text-to-speech system without prosody model

Towards a multilingual prosody model for text-to-speech

IEEE International Conference on Acoustics Speech and Signal Processing ◽

10.1109/icassp.2002.5743744 ◽

2002 ◽

Author(s):

Oliver Jokisch ◽

Hongwei Ding ◽

Hans Kruschke

Keyword(s):

Text To Speech ◽

Prosody Model

Download Full-text

Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis

Computer Speech & Language ◽

10.1016/j.csl.2020.101183 ◽

2021 ◽

Vol 67 ◽

pp. 101183

Author(s):

Yusuke Yasuda ◽

Xin Wang ◽

Junichi Yamagishi

Keyword(s):

Speech Synthesis ◽

Text To Speech ◽

Linguistic Features ◽

Learning Abilities ◽

Text To Speech Synthesis

Download Full-text

Towards a multilingual prosody model for text-to-speech

IEEE International Conference on Acoustics Speech and Signal Processing ◽

10.1109/icassp.2002.1005766 ◽

2002 ◽

Cited By ~ 1

Author(s):

Jokisch ◽

Ding ◽

Kruschke

Keyword(s):

Text To Speech ◽

Prosody Model

Download Full-text

RobuTrans: A Robust Transformer-Based Text-to-Speech Model

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6337 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8228-8235

Author(s):

Naihan Li ◽

Yanqing Liu ◽

Yu Wu ◽

Shujie Liu ◽

Sheng Zhao ◽

...

Keyword(s):

Neural Network ◽

Speech Synthesis ◽

Neural Model ◽

Attention Mechanism ◽

Maximum Length ◽

Prosodic Features ◽

Text To Speech ◽

Linguistic Features ◽

Excellent Quality ◽

Speech Model

Recently, neural network based speech synthesis has achieved outstanding results, by which the synthesized audios are of excellent quality and naturalness. However, current neural TTS models suffer from the robustness issue, which results in abnormal audios (bad cases) especially for unusual text (unseen context). To build a neural model which can synthesize both natural and stable audios, in this paper, we make a deep analysis of why the previous neural TTS models are not robust, based on which we propose RobuTrans (Robust Transformer), a robust neural TTS model based on Transformer. Comparing to TransformerTTS, our model first converts input texts to linguistic features, including phonemic features and prosodic features, then feed them to the encoder. In the decoder, the encoder-decoder attention is replaced with a duration-based hard attention mechanism, and the causal self-attention is replaced with a "pseudo non-causal attention" mechanism to model the holistic information of the input. Besides, the position embedding is replaced with a 1-D CNN, since it constrains the maximum length of synthesized audio. With these modifications, our model not only fix the robustness problem, but also achieves on parity MOS (4.36) with TransformerTTS (4.37) and Tacotron2 (4.37) on our general set.

Download Full-text

PROSODY PREDICTION FOR TAMIL TEXT-TO-SPEECH SYNTHESIZER USING SENTIMENT ANALYSIS

Asian Journal of Pharmaceutical and Clinical Research ◽

10.22159/ajpcr.2017.v10s1.19535 ◽

2017 ◽

Vol 10 (13) ◽

pp. 6

Author(s):

Vaibhavi Rajendran ◽

G Bharadwaja Kumar

Keyword(s):

Prediction Model ◽

Sentiment Analysis ◽

Classification Accuracy ◽

Prediction Models ◽

Text To Speech ◽

Speech Synthesizer ◽

Human Voice ◽

Prosody Model ◽

The One ◽

Prosody Prediction

A speech synthesizer which sounds similar to a human voice is preferred over a robotic voice, and hence to increase the naturalness of a speech synthesizer an efficacious prosody model is imperative. Hence, this paper is focused on developing a prosody prediction model using sentiment analysis for a Tamil speech synthesizer. Two variations of prosody prediction models using SentiWordNet are experimented: one without a stemmer and the other with a stemmer. The prosody prediction model with a stemmer performs much more efficiently than the one without a stemmer as it tackles the highly agglutinative and inflectional words in Tamil language in a better way and is exemplified clearly, in this paper. The performance of the prosody prediction model with a stemmer has a higher classification accuracy of 77% on the test set in comparison to the 57% accuracy by the prosody model without a stemmer.

Download Full-text

Novel eigenpitch-based prosody model for text-to-speech synthesis

10.21437/interspeech.2007-229 ◽

2007 ◽

Author(s):

Jilei Tian ◽

Jani Nurminen ◽

Imre Kiss

Keyword(s):

Speech Synthesis ◽

Text To Speech ◽

Text To Speech Synthesis ◽

Prosody Model

Download Full-text

Prosody model in a Mandarin text-to-speech system based on a hierarchical approach

2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532) ◽

10.1109/icme.2000.869636 ◽

2002 ◽

Cited By ~ 3

Author(s):

Neng-Huang Pan ◽

Wen-Tsai Jen ◽

Shyr-Shen Yu ◽

Ming-Shing Yu ◽

Shyh-Yang Huang ◽

...

Keyword(s):

Hierarchical Approach ◽

Text To Speech ◽

Prosody Model

Download Full-text

A Review of 21 iPad Applications for Augmentative and Alternative Communication Purposes

Perspectives on Augmentative and Alternative Communication ◽

10.1044/aac21.2.60 ◽

2012 ◽

Vol 21 (2) ◽

pp. 60-71 ◽

Cited By ~ 24

Author(s):

Ashley Alliano ◽

Kimberly Herriger ◽

Anthony D. Koutsoftas ◽

Theresa E. Bartolotta

Keyword(s):

Augmentative And Alternative Communication ◽

Cost Effective ◽

Alternative Form ◽

Alternative Communication ◽

Text To Speech ◽

Reference Guide ◽

Expressive Communication ◽

Communication Needs ◽

User Friendly ◽

Ipad Applications

Abstract Using the iPad tablet for Augmentative and Alternative Communication (AAC) purposes can facilitate many communicative needs, is cost-effective, and is socially acceptable. Many individuals with communication difficulties can use iPad applications (apps) to augment communication, provide an alternative form of communication, or target receptive and expressive language goals. In this paper, we will review a collection of iPad apps that can be used to address a variety of receptive and expressive communication needs. Based on recommendations from Gosnell, Costello, and Shane (2011), we describe the features of 21 apps that can serve as a reference guide for speech-language pathologists. We systematically identified 21 apps that use symbols only, symbols and text-to-speech, and text-to-speech only. We provide descriptions of the purpose of each app, along with the following feature descriptions: speech settings, representation, display, feedback features, rate enhancement, access, motor competencies, and cost. In this review, we describe these apps and how individuals with complex communication needs can use them for a variety of communication purposes and to target a variety of treatment goals. We present information in a user-friendly table format that clinicians can use as a reference guide.

Download Full-text

Using computerized text analysis to examine associations between linguistic features and clients’ distress during psychotherapy.

Journal of Counseling Psychology ◽

10.1037/cou0000440 ◽

2020 ◽

Cited By ~ 1

Author(s):

Natalie Shapira ◽

Gal Lazarus ◽

Yoav Goldberg ◽

Eva Gilboa-Schechtman ◽

Rivka Tuval-Mashiach ◽

...

Keyword(s):

Text Analysis ◽

Linguistic Features ◽

Computerized Text Analysis

Download Full-text

Design of English text-to-speech conversion algorithm based on machine learning

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189238 ◽

2020 ◽

pp. 1-12

Author(s):

Li Dongmei

Keyword(s):

Machine Learning ◽

Speech Synthesis ◽

Feature Recognition ◽

Learning Algorithm ◽

Morphological Structure ◽

English Text ◽

Text To Speech ◽

Part Of Speech ◽

Modern Computer ◽

Conversion Algorithm

English text-to-speech conversion is the key content of modern computer technology research. Its difficulty is that there are large errors in the conversion process of text-to-speech feature recognition, and it is difficult to apply the English text-to-speech conversion algorithm to the system. In order to improve the efficiency of the English text-to-speech conversion, based on the machine learning algorithm, after the original voice waveform is labeled with the pitch, this article modifies the rhythm through PSOLA, and uses the C4.5 algorithm to train a decision tree for judging pronunciation of polyphones. In order to evaluate the performance of pronunciation discrimination method based on part-of-speech rules and HMM-based prosody hierarchy prediction in speech synthesis systems, this study constructed a system model. In addition, the waveform stitching method and PSOLA are used to synthesize the sound. For words whose main stress cannot be discriminated by morphological structure, label learning can be done by machine learning methods. Finally, this study evaluates and analyzes the performance of the algorithm through control experiments. The results show that the algorithm proposed in this paper has good performance and has a certain practical effect.

Download Full-text