日本語
 
Help Privacy Policy ポリシー/免責事項
  詳細検索ブラウズ

アイテム詳細


公開

会議論文

WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

MPS-Authors
/persons/resource/persons255681

van Rijn,  Pol       
Department of Neuroscience, Max Planck Institute for Empirical Aesthetics, Max Planck Society;

/persons/resource/persons242173

Jacoby,  Nori       
Research Group Computational Auditory Perception, Max Planck Institute for Empirical Aesthetics, Max Planck Society;

External Resource
There are no locators available
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
フルテキスト (公開)
公開されているフルテキストはありません
付随資料 (公開)
There is no public supplementary material available
引用

Siuzdak, H., Dura, P., van Rijn, P., & Jacoby, N. (2022). WavThruVec: Latent speech representation as intermediate features for neural speech synthesis. In Proceedings Interspeech 2022 (pp. 833-837). doi:10.21437/Interspeech.2022-10797.


引用: https://hdl.handle.net/21.11116/0000-000C-D87C-6
要旨
Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizing low-level intermediate speech representation such as mel-spectrograms. However, such predetermined features are fundamentally limited, because they do not allow to exploit the full potential of a data-driven approach through learning hidden representations. For this reason, several end-to-end methods have been proposed. However, such models are harder to train and require a large number of high-quality recordings with transcriptions. Here, we propose WavThruVec - a two-stage architecture that resolves the bottleneck by using high-dimensional wav2vec 2.0 embeddings as intermediate speech representation. Since these hidden activations provide high-level linguistic features, they are more robust to noise. That allows us to utilize annotated speech datasets of a lower quality to train the first-stage module. At the same time, the second-stage component can be trained on large-scale untranscribed audio corpora, as wav2vec 2.0 embeddings are already time-aligned. This results in an increased generalization capability to out-of-vocabulary words, as well as to a better generalization to unseen speakers. We show that the proposed model not only matches the quality of state-of-the-art neural models, but also presents useful properties enabling tasks like voice conversion or zero-shot synthesis.