English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Conference Paper

WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

MPS-Authors
/persons/resource/persons255681

van Rijn,  Pol       
Department of Neuroscience, Max Planck Institute for Empirical Aesthetics, Max Planck Society;

/persons/resource/persons242173

Jacoby,  Nori       
Research Group Computational Auditory Perception, Max Planck Institute for Empirical Aesthetics, Max Planck Society;

External Resource
No external resources are shared
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)
There are no public fulltexts stored in PuRe
Supplementary Material (public)
There is no public supplementary material available
Citation

Siuzdak, H., Dura, P., van Rijn, P., & Jacoby, N. (2022). WavThruVec: Latent speech representation as intermediate features for neural speech synthesis. In Proceedings Interspeech 2022 (pp. 833-837). doi:10.21437/Interspeech.2022-10797.


Cite as: https://hdl.handle.net/21.11116/0000-000C-D87C-6
Abstract
Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizing low-level intermediate speech representation such as mel-spectrograms. However, such predetermined features are fundamentally limited, because they do not allow to exploit the full potential of a data-driven approach through learning hidden representations. For this reason, several end-to-end methods have been proposed. However, such models are harder to train and require a large number of high-quality recordings with transcriptions. Here, we propose WavThruVec - a two-stage architecture that resolves the bottleneck by using high-dimensional wav2vec 2.0 embeddings as intermediate speech representation. Since these hidden activations provide high-level linguistic features, they are more robust to noise. That allows us to utilize annotated speech datasets of a lower quality to train the first-stage module. At the same time, the second-stage component can be trained on large-scale untranscribed audio corpora, as wav2vec 2.0 embeddings are already time-aligned. This results in an increased generalization capability to out-of-vocabulary words, as well as to a better generalization to unseen speakers. We show that the proposed model not only matches the quality of state-of-the-art neural models, but also presents useful properties enabling tasks like voice conversion or zero-shot synthesis.