COMPARISON OF EXPRESSIVE SPEECH SYNTHESIS SYSTEMS WITH THE POSSIBILITY OF EMOTION-STRENGTH ADJUSTMENT
DOI:
https://doi.org/10.24867/11BE18VujovicKeywords:
expressive speech synthesis, emotion modeling, embedding vectors, deep neural networksAbstract
In expressive speech synthesis, it is important to generate emotional speech that reflects the complexity of emotional states. Many TTS systems model emotions in discrete codes, but modeling variations within emotional states is crucial for generating human-like speech. The paper presents a theoretical analysis and comparison of two innovative expressive TTS systems that model the complexity of emotion in the form of a continuous vector which can be manipulated. The results show that the approach based on continuous t-SNE embedding vectors is applicable only in the case of specific data bases, while the other approach, based on interpolation of points in the embedding space of a multi-speaker, multi-style model, is more general, but requires additional analysis.
References
[2] Yamagishi J., Onishi K., Masuko T., Kobayashi T., “Acoustic modeling of speaking styles and emotional expressions in HMM based speech synthesis”, IEICE TRANSACTIONS on Information and Systems 88, 502–509., 2005.
[3] L. Xue, X. Zhu, X. An, L. Xie, “A comparison of expressive speech synthesis approaches based on neural network”, Proc.the Joint Workshop of the 4th Workshop 60 on Affective Social Multimedia Computing and first Multi-Modal Affective Computing of Large-Scale Multimedia Data, pp. 15–20, 2018.
[4] Katsuki Inoue, Sunao Hara, Masanobu Abe, Nobukatsu Hojo, Yusuke Ijima, “An investigation to transplant emotional expressions in DNN-based tts synthesis”, Asia- Pacific Signal and Information Processing Association Summit and Conference, pages 1253–1258, 2017.
[5] Zhu, X., Xue, L., “Building a Controllable Expressive Speech Synthesis System with Multiple Emotion Strengths”, Cognitive Systems Research, Volume 59, Pages 151-159 January 2020.
[6] Milan Sečujski, Darko Pekar, Siniša Suzić, Anton Smirnov, Tijana Nosek, “Speaker/Style-Dependent Neural Network Speech Synthesis Based on Speaker/Style Embedding“, Journal of Universal Computer Science, vol. 26, no. 4, 434-453, 2020.
[7] Florian Eyben, Felix Weninger, Martin Wӧllmer, Bjӧrn Schuller, “open-Source Media Interpretation by Large feature-space Extraction“, audEERING GmbH, Version 2.3, November 2016.
[8] Laurens van der Maaten, Geoffrey Hinton, “Visualizing Data using t-SNE“, Journal of Machine Learning Research 9, 2579-2605, 2008.