Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation
Authors: Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, Jingdong Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We have conducted extensive experiments to evaluate our method on publicly available datasets, including HDTF, Celeb V, and our introduced Wild dataset. The experimental results demonstrate that our approach achieves state-of-the-art performance in long-duration portrait video animation, successfully generating rich and controllable content at 4K resolution for duration extending up to tens of minutes. |
| Researcher Affiliation | Collaboration | Jiahao Cui1 , Hui Li1 , Yao Yao3, Hao Zhu3, Hanlin Shang1, Kaihui Cheng1, Hang Zhou2 Siyu Zhu1 , Jingdong Wang2 1Fudan University 2Baidu Inc. 3Nanjing University |
| Pseudocode | No | The paper describes the method using textual descriptions and figures (Figure 3 and 4) but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We have conducted extensive experiments to evaluate our method on publicly available datasets, including HDTF, Celeb V, and our introduced Wild dataset. |
| Dataset Splits | No | The paper mentions using approximately 160 hours of video data for training but does not provide specific train/test/validation splits (percentages or counts) for the datasets used in experiments. |
| Hardware Specification | Yes | All experiments were conducted on a GPU server equipped with 8 NVIDIA A100 GPUs. |
| Software Dependencies | Yes | Regarding textual control, we employed the vision-language model Mini CPM Hu et al. (2024) to generate textual prompts. These prompts were refined using Llama 3.1 |
| Experiment Setup | Yes | The training process was executed in two stages: the first stage comprised 30,000 steps with a step size of 4, targeting a video resolution of 512 512 pixels. The second stage involved 28,000 steps with a batch size of 4, initializing the motion module with weights from Animatediff. Approximately 160 hours of video data were utilized across both stages, with a learning rate set at 1e-5. For the super-resolution component, training for temporal alignment was extended to 550,000 steps, leveraging initial weights from Code Former and a learning rate of 1e-4, using the VFHQ dataset as the super-resolution training data. |