Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

Authors: Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, Yanbo Zheng

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that Loopy outperforms recent audio-driven portrait diffusion models, delivering more lifelike and high-quality results across various scenarios. Video samples are available at this URL. 3.5 EXPERIMENTS 3.5.1 RESULTS AND ANALYSIS 3.5.2 ABLATION STUDIES
Researcher Affiliation Collaboration Jianwen Jiang1 , Chao Liang1 , Jiaqi Yang1 , Gaojie Lin1, Tianyun Zhong2 , Yanbo Zheng1 1Byte Dance, 2Zhejiang University
Pseudocode No The paper describes its methodology in natural language and mathematical equations, but does not include any clearly labeled pseudocode or algorithm blocks. Figures 2, 3, and 4 are architectural diagrams, not pseudocode.
Open Source Code No The paper states: "Video samples are available at this URL." and "Video samples are provided in the supplementary materials." and "We provided videos where Loopy performs poorly, along with an analysis, in the videos provided on our project homepage.". These statements refer to video samples and a project homepage, not explicitly to the open-sourcing of the code for the described methodology.
Open Datasets Yes For test sets, we randomly sampled 100 videos from Celeb V-HQ (Zhu et al., 2022) (a public high-quality celebrity video dataset with mixed scenes), RAVDESS (Kaggle) (a public high-definition indoor talking scene dataset with rich emotions) and HDTF (Zhang et al., 2021). Our training data includes the public talking head dataset HDTF (Zhang et al., 2021) and other sources from which talking head data can be obtained through post-processing, including data sources listed in Open Vid (Nan et al., 2024) and VFHQ (Xie et al., 2022), and online video platforms such as Pexels.
Dataset Splits No For training data, we collected and filtered talking head videos from multiple sources... This resulted in 174 hours of training data. The dataset are detailed in the Appendix A. For test sets, we randomly sampled 100 videos from Celeb V-HQ... RAVDESS... and HDTF... Videos selected for testing are excluded from the training split. The paper specifies the composition and size of its test sets and that test data was excluded from training, but it does not provide specific training/validation/test splits (e.g., percentages or exact counts for the main 174-hour training dataset) that would be needed for full reproducibility of data partitioning.
Hardware Specification Yes We trained our model using 24 Nvidia A100 GPUs with a batch size of 24... The experiments were conducted on an unloaded A100. Loopy takes 18 seconds to complete 25 steps of denoising to generate 12 frames on an A100 GPU
Software Dependencies No For the audio condition, we first use wav2vec (Baevski et al., 2020; Schneider et al., 2019) for audio feature extraction... we used DWPose (Yang et al., 2023) to detect facial keypoints... We employ Q-Align (Wu et al., 2023) to assess the visual quality of generated videos... we employ the smooth metric from the widely used video generation evaluation tool VBench (Huang et al., 2024)... We report the widely used Sync-C (confidence) and Sync-D (distance) (proposed in Sync Net (Chung & Zisserman, 2017)). The paper mentions several software tools and libraries (wav2vec, DWPose, Q-Align, VBench, Sync Net) but does not provide specific version numbers for any of them, which is required for reproducible software dependencies.
Experiment Setup Yes We trained our model using 24 Nvidia A100 GPUs with a batch size of 24, using the Adam W (Loshchilov & Hutter, 2017) optimizer with a learning rate of 1e-5 to train the model for two stages, each lasting 4 days. The generated video length was set to 12 frames, and the motion frame was set to 124 frames... The training videos were uniformly processed at 25 FPS and cropped to 512 512 portrait videos. During Inference, we perform class-free guidance (Ho & Salimans, 2022) using multiple conditions... The audio ratio is set to 5 and the reference ratio to 3. We use DDIM with 25 denoising steps for inference.