Audio samples from "SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate"

Authors: 1Nabarun Goswami, 1,2Tatsuya Harada

Affiliation: 1The University of Tokyo, 2RIKEN

Accepted to Interspeech 2022

Contact

Download PDF

Abstract: The mapping of text to speech (TTS) is non-deterministic, letters may be pronounced differently based on context, or phonemes can vary depending on various physiological and stylistic factors like gender, age, accent, emotions, etc. Neural speaker embeddings, trained to identify or verify speakers are typically used to represent and transfer such characteristics from reference speech to synthesized speech. Speech separation on the other hand is the challenging task of separating individual speakers from an overlapping mixed signal of various speakers. Speaker attractors are high-dimensional embedding vectors that pull the time-frequency bins of each speaker's speech towards themselves while repelling those belonging to other speakers. In this work, we explore the possibility of using these powerful speaker attractors for zero-shot speaker adaptation in multi-speaker TTS synthesis and propose speaker attractor text to speech (SATTS). Through various experiments, we show that SATTS can synthesize natural speech from text from an unseen target speaker's reference signal which might have less than ideal recording conditions, i.e. reverberations or mixed with other speakers.

satts
sanet

Speaker Adaptation for Unseen Speakers for Reference Speech Mixed with Another Speaker

The speaker attractor/embedding is extracted from the Mixeds-Reference sample.

These examples are sampled from the evaluation set for Table 3 in the paper.

Target Speaker IDTarget Clean ReferenceMixed-ReferenceSATTS
LibriTTS 908
LibriTTS 1580

Speaker Adaptation for Unseen Speakers for Random Room Impulse Responses

The speaker attractor/embedding is extracted from the RIR-Reference sample.

These examples are sampled from the evaluation set for Table 2 in the paper.

Speaker IDClean ReferenceRIR-ReferenceSATTSSV2TTS
LibriTTS 237
LibriTTS 1580
VCTK 234
VCTK 245

Speaker Adaptation for Unseen Speakers under Clean Recording Conditions

Each column corresponds to a single speaker. The speaker name is in "Dataset SpeakerID" format. The first row is the reference audio used to compute the speaker attractor.

These examples are sampled from the evaluation set for Table 1 in the paper.

VCTK p347VCTK p261LibriTTS 1188LibriTTS 2300
Reference:
Synthesized (SATTS):

License for LibriTTS dataset: https://creativecommons.org/licenses/by/4.0/

License for VCTK dataset: https://opendatacommons.org/licenses/by/1-0/