reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Authors: Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, Jingdong Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We have conducted extensive experiments to evaluate our method on publicly available datasets, including HDTF, Celeb V, and our introduced Wild dataset. The experimental results demonstrate that our approach achieves state-of-the-art performance in long-duration portrait video animation, successfully generating rich and controllable content at 4K resolution for duration extending up to tens of minutes.
Researcher Affiliation	Collaboration	Jiahao Cui1 , Hui Li1 , Yao Yao3, Hao Zhu3, Hanlin Shang1, Kaihui Cheng1, Hang Zhou2 Siyu Zhu1 , Jingdong Wang2 1Fudan University 2Baidu Inc. 3Nanjing University
Pseudocode	No	The paper describes the method using textual descriptions and figures (Figure 3 and 4) but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a link to a code repository.
Open Datasets	Yes	We have conducted extensive experiments to evaluate our method on publicly available datasets, including HDTF, Celeb V, and our introduced Wild dataset.
Dataset Splits	No	The paper mentions using approximately 160 hours of video data for training but does not provide specific train/test/validation splits (percentages or counts) for the datasets used in experiments.
Hardware Specification	Yes	All experiments were conducted on a GPU server equipped with 8 NVIDIA A100 GPUs.
Software Dependencies	Yes	Regarding textual control, we employed the vision-language model Mini CPM Hu et al. (2024) to generate textual prompts. These prompts were refined using Llama 3.1
Experiment Setup	Yes	The training process was executed in two stages: the first stage comprised 30,000 steps with a step size of 4, targeting a video resolution of 512 512 pixels. The second stage involved 28,000 steps with a batch size of 4, initializing the motion module with weights from Animatediff. Approximately 160 hours of video data were utilized across both stages, with a learning rate set at 1e-5. For the super-resolution component, training for temporal alignment was extended to 550,000 steps, leveraging initial weights from Code Former and a learning rate of 1e-4, using the VFHQ dataset as the super-resolution training data.