RealPortrait: Realistic Portrait Animation with Diffusion Transformers

Authors: Zejun Yang, Huawei Wei, Zhisheng Wang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations demonstrate that Real Portrait excels in generating portrait animations with highly-realistic quality and exceptional temporal coherence. The paper includes sections like "Experiments", "Implementation Details", "Datasets", "Comparisons and Evaluations", "Ablation Studies", and presents quantitative comparisons in tables (Table 1, Table 2, Table 3) and qualitative results in figures (Figure 3, Figure 4, Figure 5), indicating an experimental approach with data analysis.
Researcher Affiliation Industry Tencent EMAIL. All authors are affiliated with Tencent, which is a company.
Pseudocode No The paper describes methods and processes using descriptive text and mathematical equations, but does not include any clearly labeled pseudocode or algorithm blocks. For example, the "Method" section explains the framework without using a structured pseudocode format.
Open Source Code No The paper does not contain any explicit statements about releasing source code, nor does it provide any links to a code repository or mention code in supplementary materials for the described methodology.
Open Datasets Yes Our data comprises both image and video datasets. The former includes several publicly available face datasets, such as VGGFace2 (Cao et al. 2018), Celeb A (Liu et al. 2015), and FFHQ (Karras, Laine, and Aila 2019). The video datasets include several face video datasets, such as Celeb V-HQ (Zhu et al. 2022), Talking Head-1KH (Wang, Mallya, and Liu 2021), HDTF (Zhang et al. 2021), MEAD (Wang et al. 2020), and VFHQ (Xie et al. 2022).
Dataset Splits Yes For evaluation, we split 100 clips from VFHQ and Celeb V-HQ to form a test set. In the second stage, we transition to a multi-frame training mode, with a primary focus on temporal consistency. During this stage, we exclusively utilize video data, randomly selecting a continuous sequence of L frames from the video as a training clip, where L is set to 16 in our experiments.
Hardware Specification Yes We train the model in two stages using 4 A100 GPUs.
Software Dependencies Yes We use the VAE from SD1.5 (Rombach et al. 2022) to encode images into the latent space, where the latent features have a resolution of 64x64. We utilize the STDi T2 structure from Open-Sora as our architecture.
Experiment Setup Yes The backbone consists of 28 blocks, while the Control Net comprises 13 blocks. All samples are cropped to a resolution of 512x512, and we do not perform alignment on the faces. In the first stage, the batch size is 32, and in the second stage, the batch size is 4. The first stage is trained for 20K steps, and the second stage is trained for 30K steps, with the total training time approaching one week. During the testing phase, our denoising process consists of 50 steps. We randomly select a continuous sequence of L frames from the video as a training clip, where L is set to 16 in our experiments.