reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RealPortrait: Realistic Portrait Animation with Diffusion Transformers

Authors: Zejun Yang, Huawei Wei, Zhisheng Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations demonstrate that Real Portrait excels in generating portrait animations with highly-realistic quality and exceptional temporal coherence. The paper includes sections like "Experiments", "Implementation Details", "Datasets", "Comparisons and Evaluations", "Ablation Studies", and presents quantitative comparisons in tables (Table 1, Table 2, Table 3) and qualitative results in figures (Figure 3, Figure 4, Figure 5), indicating an experimental approach with data analysis.
Researcher Affiliation	Industry	Tencent EMAIL. All authors are affiliated with Tencent, which is a company.
Pseudocode	No	The paper describes methods and processes using descriptive text and mathematical equations, but does not include any clearly labeled pseudocode or algorithm blocks. For example, the "Method" section explains the framework without using a structured pseudocode format.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code, nor does it provide any links to a code repository or mention code in supplementary materials for the described methodology.
Open Datasets	Yes	Our data comprises both image and video datasets. The former includes several publicly available face datasets, such as VGGFace2 (Cao et al. 2018), Celeb A (Liu et al. 2015), and FFHQ (Karras, Laine, and Aila 2019). The video datasets include several face video datasets, such as Celeb V-HQ (Zhu et al. 2022), Talking Head-1KH (Wang, Mallya, and Liu 2021), HDTF (Zhang et al. 2021), MEAD (Wang et al. 2020), and VFHQ (Xie et al. 2022).
Dataset Splits	Yes	For evaluation, we split 100 clips from VFHQ and Celeb V-HQ to form a test set. In the second stage, we transition to a multi-frame training mode, with a primary focus on temporal consistency. During this stage, we exclusively utilize video data, randomly selecting a continuous sequence of L frames from the video as a training clip, where L is set to 16 in our experiments.
Hardware Specification	Yes	We train the model in two stages using 4 A100 GPUs.
Software Dependencies	Yes	We use the VAE from SD1.5 (Rombach et al. 2022) to encode images into the latent space, where the latent features have a resolution of 64x64. We utilize the STDi T2 structure from Open-Sora as our architecture.
Experiment Setup	Yes	The backbone consists of 28 blocks, while the Control Net comprises 13 blocks. All samples are cropped to a resolution of 512x512, and we do not perform alignment on the faces. In the first stage, the batch size is 32, and in the second stage, the batch size is 4. The first stage is trained for 20K steps, and the second stage is trained for 30K steps, with the total training time approaching one week. During the testing phase, our denoising process consists of 50 steps. We randomly select a continuous sequence of L frames from the video as a training clip, where L is set to 16 in our experiments.