WavThruVec: Latent speech representation as intermediate features for neural 
speech synthesis

Siuzdak, Hubert; Dura, Piotr; van Rijn, Pol; Jacoby, Nori

doi:10.21437/Interspeech.2022-10797

Local TagsRelease HistoryDetailsSummary

WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

Siuzdak, H., Dura, P., van Rijn, P., & Jacoby, N. (2022). WavThruVec: Latent speech representation as intermediate features for neural speech synthesis. In Proceedings Interspeech 2022 (pp. 833-837). doi:10.21437/Interspeech.2022-10797.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/21.11116/0000-000C-D87C-6 Version Permalink: https://hdl.handle.net/21.11116/0000-000C-D87D-5

Genre: Conference Paper

Files

show Files

Locators

show

Creators

show

hide

Creators:
Siuzdak, Hubert¹, Author
Dura, Piotr¹, Author
van Rijn, Pol², Author
Jacoby, Nori³, Author

Affiliations:
1Charactr, Inc, ou_persistent22
2Department of Neuroscience, Max Planck Institute for Empirical Aesthetics, Max Planck Society, ou_2421697
3Research Group Computational Auditory Perception, Max Planck Institute for Empirical Aesthetics, Max Planck Society, ou_3024247

Content

show

hide

Free keywords: text-to-speech, intermediate speech representation, end-to-end learning, voice conversion, zero-shot synthesis

Abstract: Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizing low-level intermediate speech representation such as mel-spectrograms. However, such predetermined features are fundamentally limited, because they do not allow to exploit the full potential of a data-driven approach through learning hidden representations. For this reason, several end-to-end methods have been proposed. However, such models are harder to train and require a large number of high-quality recordings with transcriptions. Here, we propose WavThruVec - a two-stage architecture that resolves the bottleneck by using high-dimensional wav2vec 2.0 embeddings as intermediate speech representation. Since these hidden activations provide high-level linguistic features, they are more robust to noise. That allows us to utilize annotated speech datasets of a lower quality to train the first-stage module. At the same time, the second-stage component can be trained on large-scale untranscribed audio corpora, as wav2vec 2.0 embeddings are already time-aligned. This results in an increased generalization capability to out-of-vocabulary words, as well as to a better generalization to unseen speakers. We show that the proposed model not only matches the quality of state-of-the-art neural models, but also presents useful properties enabling tasks like voice conversion or zero-shot synthesis.

Details

show

hide

Language(s): eng - English

Dates: Published Online: 2022

Publication Status: Published online

Pages: -

Publishing info: -

Table of Contents: -

Rev. Type: -

Identifiers: DOI: 10.21437/Interspeech.2022-10797

Degree: -

Event

show

hide

Title: Interspeech 2022

Place of Event: Incheon, Korea

Start-/End Date: 2022-09-18 - 2022-09-22

Legal Case

show

Project information

show

Source 1

show

hide

Title: Proceedings Interspeech 2022

Source Genre: Proceedings

Creator(s):

Affiliations:

Publ. Info: -

Pages: - Volume / Issue: - Sequence Number: - Start / End Page: 833 - 837 Identifier: ISSN: 2308-457X