Dates
Friday, May 20, 2022 - 10:30am to Friday, May 20, 2022 - 12:00pm
Location
NCS 220
Event Description

Abstract:


Synthesizing high-fidelity talking head videos of an arbitrary identity to match a target speech segment is a challenging problem. Earlier approaches for lip synced video synthesis propose detailed pipelines of hand-engineered steps. With the rise of neural networks, encoder-decoder architectures that can be trained end-to-end become popular. More recently, GAN-based methods produce photorealistic results by training a model on a large amount of videos, leading to a universal generator that can be applied to any speaker. Despite their success, these methods are limited to low resolution inputs. To address this limitation, dynamic neural radiance fields (NeRFs) conditioned on audio input have been explored in the last few months. However, NeRF-based approaches still lack in terms of lip synchronization. In this work, we propose a method that bridges the gap between the high-fidelity talking head videos of the GAN-based approaches and the high-resolution quality of the NeRF-based ones. Leveraging the lip sync accuracy of a pre-trained model on low resolution videos, we extract the expression parameters of a morphable model that is fitted to its results. Then, a NeRF conditioned on audio, expression and learned latent codes produces the final talking head videos through volumetric rendering. Quantitative and qualitative evaluation on dubbed movies demonstrate the potential of our method.

Event Title
Ph.D. Research Proficiency Presentation: Aggelina Chatziagapi, 'Audio-Driven Talking Head Video Synthesis'