COMPARISON OF EXPRESSIVE SPEECH SYNTHESIS SYSTEMS WITH THE POSSIBILITY OF EMOTION-STRENGTH ADJUSTMENT

Authors

  • Mia Vujović Autor

DOI:

https://doi.org/10.24867/11BE18Vujovic

Keywords:

expressive speech synthesis, emotion modeling, embedding vectors, deep neural networks

Abstract

In expressive speech synthesis, it is important to generate emotional speech that reflects the complexity of emotional states. Many TTS systems model emotions in discrete codes, but modeling variations within emotional states is crucial for generating human-like speech. The paper presents a theoretical analysis and comparison of two innovative expressive TTS systems that model the complexity of emotion in the form of a continuous vector which can be manipulated. The results show that the approach based on continuous t-SNE embedding vectors is applicable only in the case of specific data bases, while the other approach, based on interpolation of points in the embedding space of a multi-speaker, multi-style model, is more general, but requires additional analysis.

References

[1] Iida A., Campbell N., Higuchi F., Yasumura M., “A corpus based speech synthesis system with emotion”, Speech Communication 40, 161–187. 10, 2003.
[2] Yamagishi J., Onishi K., Masuko T., Kobayashi T., “Acoustic modeling of speaking styles and emotional expressions in HMM based speech synthesis”, IEICE TRANSACTIONS on Information and Systems 88, 502–509., 2005.
[3] L. Xue, X. Zhu, X. An, L. Xie, “A comparison of expressive speech synthesis approaches based on neural network”, Proc.the Joint Workshop of the 4th Workshop 60 on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data, pp. 15–20, 2018.
[4] Katsuki Inoue, Sunao Hara, Masanobu Abe, Nobukatsu Hojo, Yusuke Ijima, “An investigation to transplant emotional expressions in DNN-based tts synthesis”, Asia- Pacific Signal and Information Processing Association Summit and Conference, pages 1253–1258, 2017.
[5] Zhu, X., Xue, L., “Building a Controllable Expressive Speech Synthesis System with Multiple Emotion Strengths”, Cognitive Systems Research, Volume 59, Pages 151-159 January 2020.
[6] Milan Sečujski, Darko Pekar, Siniša Suzić, Anton Smirnov, Tijana Nosek, “Speaker/Style-Dependent Neural Network Speech Synthesis Based on Speaker/Style Embedding“, Journal of Universal Computer Science, vol. 26, no. 4, 434-453, 2020.
[7] Florian Eyben, Felix Weninger, Martin Wӧllmer, Bjӧrn Schuller, “open-Source Media Interpretation by Large feature-space Extraction“, audEERING GmbH, Version 2.3, November 2016.
[8] Laurens van der Maaten, Geoffrey Hinton, “Visualizing Data using t-SNE“, Journal of Machine Learning Research 9, 2579-2605, 2008.

Published

2020-12-26

Issue

Section

Electrotechnical and Computer Engineering