Login | Register

On Zero-Shot Multi-Speaker Text-to-Speech Using Deep Learning


On Zero-Shot Multi-Speaker Text-to-Speech Using Deep Learning

Kandarkar, Pradnya (2023) On Zero-Shot Multi-Speaker Text-to-Speech Using Deep Learning. Masters thesis, Concordia University.

[thumbnail of Kandarkar_MCompSc_F2023.pdf]
Text (application/pdf)
Kandarkar_MCompSc_F2023.pdf - Accepted Version
Available under License Spectrum Terms of Access.


This thesis explores various aspects of zero-shot multi-speaker text-to-speech (TTS) synthesis using deep learning to create an effective system. A deep learning model for zero-shot multi-speaker TTS uses text and speaker identity as input to generate the respective output speech without fine-tuning for speakers not seen during training. The experiments consider a system with three main components: a speaker encoder network, a mel-spectrogram prediction network, and a vocoder network. A speaker encoder network captures the speaker identity in a fixed-sized speaker embedding. This speaker embedding is injected into a mel-spectrogram prediction network at one or more locations to generate a mel-spectrogram conditioned on the text and the speaker embedding. Finally, a vocoder network converts the mel-spectrogram into a waveform. All three components are trained separately. The speech synthesis aspects explored in the experiments include the speaker embedding injection method, speaker encoder network, speaker embedding injection location, and mel-spectrogram prediction network for the TTS system. The FiLM method from the visual reasoning field is adapted for the first time to inject speaker embeddings into the TTS workflow and compared against traditional methods. The significance of speaker embeddings is highlighted by comparing two well-established speaker embedding models. New combinations of speaker embedding injection locations are explored for two mel-spectrogram prediction networks. The best-performing model generates speech with naturalness ranging from fair to good, exhibits more than moderate speaker similarity, and shows potential for improvement. Additionally, the zero-shot multi-speaker TTS system is enhanced to generate fictitious voices.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Kandarkar, Pradnya
Institution:Concordia University
Degree Name:M. Comp. Sc.
Program:Computer Science
Date:28 July 2023
Thesis Supervisor(s):Ravanelli, Mirco
ID Code:992632
Deposited By: Pradnya Kandarkar
Deposited On:14 Nov 2023 20:34
Last Modified:14 Nov 2023 20:34
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top