reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention

Authors: XiaoChen Zhao, Hongyi Xu, Guoxian Song, You Xie, Chenxu Zhang, Xiu Li, Linjie Luo, Jinli Suo, Yebin Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that XNe Mo surpasses state-of-the-art baselines, producing highly expressive animations with superior identity resemblance. Our code and models will be available for research at our project page. We extensively evaluate our model across our challenging benchmarks and X-Ne Mo outperforms state-of-the-art portrait animation baselines both quantitatively and qualitatively. Additionally, our expressive latent motion descriptor serves as a unified identity-agnostic embedding, facilitating motion interpolation and video outpainting applications beyond portrait animation. We summarize our contributions as follows, A novel diffusion-based portrait animation pipeline, coupled with latent motion representation, achieving state-of-the-art performance in terms of motion accuracy and identity disentanglement. In our evaluation, we compare our method against state-of-the-art video-driven portrait animation baselines, including X-Portrait Xie et al. (2024), Ani Portrait Wei et al. (2024), Follow-your-Emoji (FYE) Ma et al. (2024), and Echomimic Chen et al. (2024b).
Researcher Affiliation	Collaboration	Xiaochen Zhao1,2 , Hongyi Xu2 , Guoxian Song2, You Xie2, Chenxu Zhang2, Xiu Li2, Linjie Luo2, Jinli Suo1, Yebin Liu1 1 Tsinghua University, 2 Byte Dance Inc.
Pseudocode	No	The paper describes the methodology using prose, equations, and diagrams, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	Our code and models will be available for research at our project page.
Open Datasets	Yes	We train our model on a combination of talking head datasets (HDTF Zhang et al. (2021), VFHQ Xie et al. (2022)) and facial expression dataset (Ner Semble Kirschstein et al. (2023)), uniformly processed at 25 fps and cropped to a 512 × 512 resolution. For evaluation, we compile a benchmark of 100 in-the-wild reference portraits Deviant Art (2024); Midjourney (2024); Pexels (2024)... Additionally we collect 100 test videos from DFEW Jiang et al. (2020) featuring emotionally expressive clips...trained with MEAD dataset Wang et al. (2020).
Dataset Splits	No	The paper describes how test videos are used for self-reenactment evaluation (first frame as reference, subsequent frames as driving/ground truth) and defines custom benchmarks (100 reference portraits, 100 test videos, 200 licensed videos). However, it does not provide explicit training/validation/test splits (e.g., percentages or counts) for the primary datasets used for model training (HDTF, VFHQ, Ner Semble).
Hardware Specification	Yes	The training is conducted on 8 Nvidia A100 GPUs using the Adam W optimizer Yao et al. (2021) with a learning rate of 1e 5. For generating a 1-second video at 25 frames per second, the process takes approximately 20 seconds and requires 24 GB of memory.
Software Dependencies	No	The paper mentions using the Adam W optimizer and a pretrained VGG-19 model, but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	The training is conducted on 8 Nvidia A100 GPUs using the Adam W optimizer Yao et al. (2021) with a learning rate of 1e 5. We use a batch size of 64 for appearance and motion control training, and a batch size of 16 for the temporal module using 24-frame video sequences. During inference, we use 25 DDIM steps Song et al. (2020a) with a classifier-free guidance (CFG) scale of 3.5. Lgan = Ladv + λr Lrecon + λvgg Lvgg + λvggf Lvggf + λfm Lfm where λr=1.0, λvgg=3e-2, λvggf=6e-3 and λfm=10.0.