Authors: 1Nabarun Goswami, 1,2Tatsuya Harada
Affiliation: 1The University of Tokyo, 2RIKEN
Accepted to Interspeech 2022
Abstract: The mapping of text to speech (TTS) is non-deterministic, letters may be pronounced differently based on context, or phonemes can vary depending on various physiological and stylistic factors like gender, age, accent, emotions, etc. Neural speaker embeddings, trained to identify or verify speakers are typically used to represent and transfer such characteristics from reference speech to synthesized speech. Speech separation on the other hand is the challenging task of separating individual speakers from an overlapping mixed signal of various speakers. Speaker attractors are high-dimensional embedding vectors that pull the time-frequency bins of each speaker's speech towards themselves while repelling those belonging to other speakers. In this work, we explore the possibility of using these powerful speaker attractors for zero-shot speaker adaptation in multi-speaker TTS synthesis and propose speaker attractor text to speech (SATTS). Through various experiments, we show that SATTS can synthesize natural speech from text from an unseen target speaker's reference signal which might have less than ideal recording conditions, i.e. reverberations or mixed with other speakers.
The speaker attractor/embedding is extracted from the Mixeds-Reference sample.
These examples are sampled from the evaluation set for Table 3 in the paper.
Target Speaker ID | Target Clean Reference | Mixed-Reference | SATTS |
---|---|---|---|
LibriTTS 908 | |||
LibriTTS 1580 |
The speaker attractor/embedding is extracted from the RIR-Reference sample.
These examples are sampled from the evaluation set for Table 2 in the paper.
Speaker ID | Clean Reference | RIR-Reference | SATTS | SV2TTS |
---|---|---|---|---|
LibriTTS 237 | ||||
LibriTTS 1580 | ||||
VCTK 234 | ||||
VCTK 245 |
Each column corresponds to a single speaker. The speaker name is in "Dataset SpeakerID" format. The first row is the reference audio used to compute the speaker attractor.
These examples are sampled from the evaluation set for Table 1 in the paper.
VCTK p347 | VCTK p261 | LibriTTS 1188 | LibriTTS 2300 |
---|---|---|---|
Reference: | |||
Synthesized (SATTS): | |||
License for LibriTTS dataset: https://creativecommons.org/licenses/by/4.0/
License for VCTK dataset: https://opendatacommons.org/licenses/by/1-0/