Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module

Gabryś, Adam; Huybrechts, Goeric; Ribeiro, Manuel Sam; Chien, Chung-Ming; Roth, Julian; Comini, Giulia; Barra-Chicote, Roberto; Perz, Bartek; Lorenzo-Trueba, Jaime

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2202.08164 (eess)

[Submitted on 16 Feb 2022]

Title:Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module

Authors:Adam Gabryś, Goeric Huybrechts, Manuel Sam Ribeiro, Chung-Ming Chien, Julian Roth, Giulia Comini, Roberto Barra-Chicote, Bartek Perz, Jaime Lorenzo-Trueba

View PDF

Abstract:State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech. When using reduced amounts of training data, standard TTS models suffer from speech quality and intelligibility degradations, making training low-resource TTS systems problematic. In this paper, we propose a novel extremely low-resource TTS method called Voice Filter that uses as little as one minute of speech from a target speaker. It uses voice conversion (VC) as a post-processing module appended to a pre-existing high-quality TTS system and marks a conceptual shift in the existing TTS paradigm, framing the few-shot TTS problem as a VC task. Furthermore, we propose to use a duration-controllable TTS system to create a parallel speech corpus to facilitate the VC task. Results show that the Voice Filter outperforms state-of-the-art few-shot speech synthesis techniques in terms of objective and subjective metrics on one minute of speech on a diverse set of voices, while being competitive against a TTS model built on 30 times more data.

Comments:	Accepted at ICASSP 2022
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2202.08164 [eess.AS]
	(or arXiv:2202.08164v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2202.08164

Submission history

From: Adam Gabryś [view email]
[v1] Wed, 16 Feb 2022 16:12:21 UTC (113 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators