Login | Register

Visual Dubbing Pipeline using Two-Pass Identity Transfer


Visual Dubbing Pipeline using Two-Pass Identity Transfer

Patel, Dhyey Devendrakumar (2022) Visual Dubbing Pipeline using Two-Pass Identity Transfer. Masters thesis, Concordia University.

[thumbnail of Patel_MSc_F2022.pdf]
Text (application/pdf)
Patel_MSc_F2022.pdf - Accepted Version
Available under License Spectrum Terms of Access.


Visual dubbing uses visual computing and deep learning to alter the lip and mouth articulations of the actor to sync with the dubbed speech. It has the potential to disrupt the dubbing industry. Quality of the dubbed result is primary for the industry. An important requirement is that visual lip sync changes be localized to the mouth region and not affect the rest of the actor's face or the rest of the video frame. Current methods can create realistic looking fake faces with expressions. However, many fail to localize lip sync and have quality problems such as identity loss, low-res, blurs, face skin feature or colour loss, and temporal jitter. These problems mainly arise because end-to-end trained networks poorly disentangle these different visual dubbing parameters (pose, skin colour, identity, lip movements, etc.). Our main contribution is a new visual dubbing pipeline, in which, instead of end-to-end training we apply incrementally different disentangling techniques for each parameter. Our pipeline is composed of three main steps: pose alignment, identity transfer, and video reassembly. Expert models in each step are fine-tuned for the actor. We propose an identity transfer network with an added style block, which with pre-training is able to decouple face components, specifically identity and expression, and also works with short video clips like TV ads. Our pipeline also includes novel stages related to temporal smoothing of the reenacted face, actor specific super resolution to retain fine facial details, and a second pass through the identity transfer network for preserving actor identity. Localization of lip-sync is achieved by restricting changes in the original video frame to just the actor's mouth region. The results are convincing, and a user survey also confirms their quality. Relevant quantitative metrics are included.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Patel, Dhyey Devendrakumar
Institution:Concordia University
Degree Name:M. Comp. Sc.
Program:Computer Science
Date:August 2022
Thesis Supervisor(s):Popa, Tiberiu and Mudur, Sudhir
ID Code:991098
Deposited By: Dhyey Devendrakumar Patel
Deposited On:27 Oct 2022 14:38
Last Modified:06 Mar 2023 16:31
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top