EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Sang-Hoon Lee and Seong-Whan Lee

Abstract

Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. Without any human annotation, we use the arousal, valence, and dominance pseudo-labels to model the complex nature of emotion via a Cartesian-spherical transformation. Furthermore, we propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics. The experimental results demonstrate the model’s ability to control emotional style and intensity with high-quality expressive speech.





Overall framework of EmoSphere-TTS

Emotional TTS Quality

Emotion : Neutral
Script : I chose the right way.

GT

BigVGAN

FastSpeech 2 w/ Emotion Label

FastSpeech 2 w/ Relative Attribute

FastSpeech 2 w/ Scaling Factor

EmoSphere-TTS (Proposed)

w/o Spherical Emotion Vector

w/o Dual Conditional Discriminator
Emotion : Angry
Script : She may mind ye of her.

GT

BigVGAN

FastSpeech 2 w/ Emotion Label

FastSpeech 2 w/ Relative Attribute

FastSpeech 2 w/ Scaling Factor

EmoSphere-TTS (Proposed)

w/o Spherical Emotion Vector

w/o Dual Conditional Discriminator
Emotion : Sad
Script : Story twenty nine a boy and a monkey.

GT

BigVGAN

FastSpeech 2 w/ Emotion Label

FastSpeech 2 w/ Relative Attribute

FastSpeech 2 w/ Scaling Factor

EmoSphere-TTS (Proposed)

w/o Spherical Emotion Vector

w/o Dual Conditional Discriminator
Emotion : Happy
Script : She is now choosing skirt to wear.

GT

BigVGAN

FastSpeech 2 w/ Emotion Label

FastSpeech 2 w/ Relative Attribute

FastSpeech 2 w/ Scaling Factor

EmoSphere-TTS (Proposed)

w/o Spherical Emotion Vector

w/o Dual Conditional Discriminator
Emotion : Surprise
Script : The nastiest things they saw were the cobwebs.

GT

BigVGAN

FastSpeech 2 w/ Emotion Label

FastSpeech 2 w/ Relative Attribute

FastSpeech 2 w/ Scaling Factor

EmoSphere-TTS (Proposed)

w/o Spherical Emotion Vector

w/o Dual Conditional Discriminator

Emotion Intensity Controllability

Emotion : Angry
Script : Our thanks to gods oath.

Relative Attribute (Weak)

Relative Attribute (Medium)

Relative Attribute (Strong)

Scaling Factor (Weak)

Scaling Factor (Medium)

Scaling Factor (Strong)

EmoSphere-TTS (Weak)

EmoSphere-TTS (Medium)

EmoSphere-TTS (Strong)
Emotion : Sad
Script : Our thanks to gods oath.

Relative Attribute (Weak)

Relative Attribute (Medium)

Relative Attribute (Strong)

Scaling Factor (Weak)

Scaling Factor (Medium)

Scaling Factor (Strong)

EmoSphere-TTS (Weak)

EmoSphere-TTS (Medium)

EmoSphere-TTS (Strong)
Emotion : Happy
Script : As rich as peters son in law!

Relative Attribute (Weak)

Relative Attribute (Medium)

Relative Attribute (Strong)

Scaling Factor (Weak)

Scaling Factor (Medium)

Scaling Factor (Strong)

EmoSphere-TTS (Weak)

EmoSphere-TTS (Medium)

EmoSphere-TTS (Strong)
Emotion : Surprise
Script : Rat came and replied on the leaves.

Relative Attribute (Weak)

Relative Attribute (Medium)

Relative Attribute (Strong)

Scaling Factor (Weak)

Scaling Factor (Medium)

Scaling Factor (Strong)

EmoSphere-TTS (Weak)

EmoSphere-TTS (Medium)

EmoSphere-TTS (Strong)

Emotional Style Shift

*Valence: the level of excitement or energy

*Arousal: positivity or negativity of emotion

*Dominance: control level within an emotional state

Emotion : Angry
Style : -Valence, -Arousal, +Dominance → +Valence, +Arousal, -Dominance
Script : Hold up my chin, slow and solid.

Base Style (Weak)

Base Style (Medium)

Base Style (Strong)

Style Shift (Weak)

Style Shift (Medium)

Style Shift (Strong)
Emotion : Angry
Style : +Valence, +Arousal, -Dominance → -Valence, -Arousal, -Dominance
Script : Let's make the noise a snake.

Base Style (Weak)

Base Style (Medium)

Base Style (Strong)

Style Shift (Weak)

Style Shift (Medium)

Style Shift (Strong)
Emotion : Sad
Style : -Valence, -Arousal, +Dominance → +Valence, +Arousal, -Dominance
Script : Then we all say aha!

Base Style (Weak)

Base Style (Medium)

Base Style (Strong)

Style Shift (Weak)

Style Shift (Medium)

Style Shift (Strong)
Emotion : Sad
Style : +Valence, +Arousal, +Dominance → +Valence, -Arousal, -Dominance
Script : I think it'll encourage me.

Base Style (Weak)

Base Style (Medium)

Base Style (Strong)

Style Shift (Weak)

Style Shift (Medium)

Style Shift (Strong)
Emotion : Happy
Style : +Valence, +Arousal, -Dominance → -Valence, -Arousal, +Dominance
Script : Take courage all isn't lost yet.

Base Style (Weak)

Base Style (Medium)

Base Style (Strong)

Style Shift (Weak)

Style Shift (Medium)

Style Shift (Strong)
Emotion : Happy
Style : +Valence, +Arousal, +Dominance → -Valence, -Arousal, -Dominance
Script : Take courage all isn't lost yet.

Base Style (Weak)

Base Style (Medium)

Base Style (Strong)

Style Shift (Weak)

Style Shift (Medium)

Style Shift (Strong)
Emotion : Surprise
Style : +Valence, +Arousal, +Dominance → +Valence, +Arousal, -Dominance
Script : Then we all say aha!

Base Style (Weak)

Base Style (Medium)

Base Style (Strong)

Style Shift (Weak)

Style Shift (Medium)

Style Shift (Strong)
Emotion : Surprise
Style : +Valence, +Arousal, +Dominance → +Valence, +Arousal, -Dominance
Script : As rich as peters son in law!

Base Style (Weak)

Base Style (Medium)

Base Style (Strong)

Style Shift (Weak)

Style Shift (Medium)

Style Shift (Strong)