VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
Authors: Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, Baining Guo
NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively. |
| Researcher Affiliation | Industry | Sicheng Xu Microsoft Research Asia EMAIL Guojun Chen Microsoft Research Asia EMAIL Yu-Xiao Guo Microsoft Research Asia EMAIL Jiaolong Yang Microsoft Research Asia EMAIL Chong Li Microsoft Research Asia EMAIL Zhenyu Zang Microsoft Research Asia EMAIL Yizhong Zhang Microsoft Research Asia EMAIL Xin Tong Microsoft Research Asia EMAIL Baining Guo Microsoft Research Asia EMAIL |
| Pseudocode | No | The paper describes the methodology but does not contain any structured pseudocode or algorithm blocks labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | Based on the RAI considerations, we will not release our code or data in case of potential misuse, as discussed in Section A. |
| Open Datasets | Yes | For face latent space learning, we use the public Vox Celeb2 dataset from [14] which contains talking face videos from about 6K subjects. |
| Dataset Splits | No | The total data used for training comprises approximately 500K clips, each lasting between 2 to 10 seconds. The parameter counts of our 3D-aided face latent model and diffusion transformer model are about 200M and 29M respectively. |
| Hardware Specification | Yes | Our face latent model takes around 7 days of training on a 4 NVIDIA RTX A6000 GPUs workstation, and the diffusion transformer takes around 3 days. ... evaluated on a desktop PC with a single NVIDIA RTX 4090 GPU. |
| Software Dependencies | No | The paper mentions using a pretrained feature extractor Wav2Vec2 [3] and Sync Net [15] for evaluation, but does not specify specific software dependencies with version numbers for its implementation. |
| Experiment Setup | Yes | For motion latent generation, we use an 8-layer transformer encoder with an embedding dim 512 and head number 8 as our diffusion network. The model is trained on Vox Celeb2 [14] and another high-resolution talk video dataset collected by us, which contains about 3.5K subjects. In our default setup, the model uses a forward-facing main gaze condition, an average head distance of all training videos, and an empty emotion offset condition. The CFG parameters are set to λA = 0.5 and λg = 1.0, and 50 sampling steps are used. |