Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Authors: Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, Jingdong Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have conducted extensive experiments to evaluate our method on publicly available datasets, including HDTF, Celeb V, and our introduced Wild dataset. The experimental results demonstrate that our approach achieves state-of-the-art performance in long-duration portrait video animation, successfully generating rich and controllable content at 4K resolution for duration extending up to tens of minutes.
Researcher Affiliation Collaboration Jiahao Cui1 , Hui Li1 , Yao Yao3, Hao Zhu3, Hanlin Shang1, Kaihui Cheng1, Hang Zhou2 Siyu Zhu1 , Jingdong Wang2 1Fudan University 2Baidu Inc. 3Nanjing University
Pseudocode No The paper describes the method using textual descriptions and figures (Figure 3 and 4) but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a link to a code repository.
Open Datasets Yes We have conducted extensive experiments to evaluate our method on publicly available datasets, including HDTF, Celeb V, and our introduced Wild dataset.
Dataset Splits No The paper mentions using approximately 160 hours of video data for training but does not provide specific train/test/validation splits (percentages or counts) for the datasets used in experiments.
Hardware Specification Yes All experiments were conducted on a GPU server equipped with 8 NVIDIA A100 GPUs.
Software Dependencies Yes Regarding textual control, we employed the vision-language model Mini CPM Hu et al. (2024) to generate textual prompts. These prompts were refined using Llama 3.1
Experiment Setup Yes The training process was executed in two stages: the first stage comprised 30,000 steps with a step size of 4, targeting a video resolution of 512 512 pixels. The second stage involved 28,000 steps with a batch size of 4, initializing the motion module with weights from Animatediff. Approximately 160 hours of video data were utilized across both stages, with a learning rate set at 1e-5. For the super-resolution component, training for temporal alignment was extended to 550,000 steps, leveraging initial weights from Code Former and a learning rate of 1e-4, using the VFHQ dataset as the super-resolution training data.