Exploring emotional prototypes in a high dimensional TTS latent space

van Rijn, Pol; Mertes, Silvan; Schiller, Dominik; Harrison, Peter M. C.; Larrouy-Maestri, Pauline; André, Elisabeth; Jacoby, Nori

doi:10.21437/Interspeech.2021-1538

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Conference Paper

Exploring emotional prototypes in a high dimensional TTS latent space

MPS-Authors

/persons/resource/persons255681

van Rijn, Pol
Department of Neuroscience, Max Planck Institute for Empirical Aesthetics, Max Planck Society;

/persons/resource/persons247689

Harrison, Peter M. C.
Research Group Computational Auditory Perception, Max Planck Institute for Empirical Aesthetics, Max Planck Society;

/persons/resource/persons179725

Larrouy-Maestri, Pauline
Department of Neuroscience, Max Planck Institute for Empirical Aesthetics, Max Planck Society;
Max-Planck-NYU, Center for Language, Music, and Emotion;

/persons/resource/persons242173

Jacoby, Nori
Research Group Computational Auditory Perception, Max Planck Institute for Empirical Aesthetics, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

There are no public fulltexts stored in PuRe

Supplementary Material (public)

There is no public supplementary material available

Citation

van Rijn, P., Mertes, S., Schiller, D., Harrison, P. M. C., Larrouy-Maestri, P., André, E., et al. (2021). Exploring emotional prototypes in a high dimensional TTS latent space. In Proceedings Interspeech 2021 (pp. 3870-3874). Baixas: ISCA. doi:10.21437/Interspeech.2021-1538.

Cite as: https://hdl.handle.net/21.11116/0000-000A-E019-D

Abstract

Recent TTS systems are able to generate prosodically varied and realistic speech. However, it is unclear how this prosodic variation contributes to the perception of speakers’ emotional states. Here we use the recent psychological paradigm ‘Gibbs Sampling with People’ to search the prosodic latent space in a trained Global Style Token Tacotron model to explore prototypes of emotional prosody. Participants are recruited online and collectively manipulate the latent space of the generative speech model in a sequentially adaptive way so that the stimulus presented to one group of participants is determined by the response of the previous groups. We demonstrate that (1) particular regions of the model’s latent space are reliably associated with particular emotions, (2) the resulting emotional prototypes are well-recognized by a separate group of human raters, and (3) these emotional prototypes can be effectively transferred to new sentences. Collectively, these experiments demonstrate a novel approach to the understanding of emotional speech by providing a tool to explore the relation between the latent space of generative models and human semantics.