日本語
 
Help Privacy Policy ポリシー/免責事項
  詳細検索ブラウズ

アイテム詳細

  WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

Siuzdak, H., Dura, P., van Rijn, P., & Jacoby, N. (2022). WavThruVec: Latent speech representation as intermediate features for neural speech synthesis. In Proceedings Interspeech 2022 (pp. 833-837). doi:10.21437/Interspeech.2022-10797.

Item is

基本情報

表示: 非表示:
アイテムのパーマリンク: https://hdl.handle.net/21.11116/0000-000C-D87C-6 版のパーマリンク: https://hdl.handle.net/21.11116/0000-000C-D87D-5
資料種別: 会議論文

ファイル

表示: ファイル

関連URL

表示:

作成者

表示:
非表示:
 作成者:
Siuzdak, Hubert1, 著者
Dura, Piotr1, 著者
van Rijn, Pol2, 著者                 
Jacoby, Nori3, 著者                 
所属:
1Charactr, Inc, ou_persistent22              
2Department of Neuroscience, Max Planck Institute for Empirical Aesthetics, Max Planck Society, ou_2421697              
3Research Group Computational Auditory Perception, Max Planck Institute for Empirical Aesthetics, Max Planck Society, ou_3024247              

内容説明

表示:
非表示:
キーワード: text-to-speech, intermediate speech representation, end-to-end learning, voice conversion, zero-shot synthesis
 要旨: Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizing low-level intermediate speech representation such as mel-spectrograms. However, such predetermined features are fundamentally limited, because they do not allow to exploit the full potential of a data-driven approach through learning hidden representations. For this reason, several end-to-end methods have been proposed. However, such models are harder to train and require a large number of high-quality recordings with transcriptions. Here, we propose WavThruVec - a two-stage architecture that resolves the bottleneck by using high-dimensional wav2vec 2.0 embeddings as intermediate speech representation. Since these hidden activations provide high-level linguistic features, they are more robust to noise. That allows us to utilize annotated speech datasets of a lower quality to train the first-stage module. At the same time, the second-stage component can be trained on large-scale untranscribed audio corpora, as wav2vec 2.0 embeddings are already time-aligned. This results in an increased generalization capability to out-of-vocabulary words, as well as to a better generalization to unseen speakers. We show that the proposed model not only matches the quality of state-of-the-art neural models, but also presents useful properties enabling tasks like voice conversion or zero-shot synthesis.

資料詳細

表示:
非表示:
言語: eng - English
 日付: 2022
 出版の状態: オンラインで出版済み
 ページ: -
 出版情報: -
 目次: -
 査読: -
 識別子(DOI, ISBNなど): DOI: 10.21437/Interspeech.2022-10797
 学位: -

関連イベント

表示:
非表示:
イベント名: Interspeech 2022
開催地: Incheon, Korea
開始日・終了日: 2022-09-18 - 2022-09-22

訴訟

表示:

Project information

表示:

出版物 1

表示:
非表示:
出版物名: Proceedings Interspeech 2022
種別: 会議論文集
 著者・編者:
所属:
出版社, 出版地: -
ページ: - 巻号: - 通巻号: - 開始・終了ページ: 833 - 837 識別子(ISBN, ISSN, DOIなど): ISSN: 2308-457X