X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention
Authors: XiaoChen Zhao, Hongyi Xu, Guoxian Song, You Xie, Chenxu Zhang, Xiu Li, Linjie Luo, Jinli Suo, Yebin Liu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that XNe Mo surpasses state-of-the-art baselines, producing highly expressive animations with superior identity resemblance. Our code and models will be available for research at our project page. We extensively evaluate our model across our challenging benchmarks and X-Ne Mo outperforms state-of-the-art portrait animation baselines both quantitatively and qualitatively. Additionally, our expressive latent motion descriptor serves as a unified identity-agnostic embedding, facilitating motion interpolation and video outpainting applications beyond portrait animation. We summarize our contributions as follows, A novel diffusion-based portrait animation pipeline, coupled with latent motion representation, achieving state-of-the-art performance in terms of motion accuracy and identity disentanglement. In our evaluation, we compare our method against state-of-the-art video-driven portrait animation baselines, including X-Portrait Xie et al. (2024), Ani Portrait Wei et al. (2024), Follow-your-Emoji (FYE) Ma et al. (2024), and Echomimic Chen et al. (2024b). |
| Researcher Affiliation | Collaboration | Xiaochen Zhao1,2 , Hongyi Xu2 , Guoxian Song2, You Xie2, Chenxu Zhang2, Xiu Li2, Linjie Luo2, Jinli Suo1, Yebin Liu1 1 Tsinghua University, 2 Byte Dance Inc. |
| Pseudocode | No | The paper describes the methodology using prose, equations, and diagrams, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Our code and models will be available for research at our project page. |
| Open Datasets | Yes | We train our model on a combination of talking head datasets (HDTF Zhang et al. (2021), VFHQ Xie et al. (2022)) and facial expression dataset (Ner Semble Kirschstein et al. (2023)), uniformly processed at 25 fps and cropped to a 512 × 512 resolution. For evaluation, we compile a benchmark of 100 in-the-wild reference portraits Deviant Art (2024); Midjourney (2024); Pexels (2024)... Additionally we collect 100 test videos from DFEW Jiang et al. (2020) featuring emotionally expressive clips...trained with MEAD dataset Wang et al. (2020). |
| Dataset Splits | No | The paper describes how test videos are used for self-reenactment evaluation (first frame as reference, subsequent frames as driving/ground truth) and defines custom benchmarks (100 reference portraits, 100 test videos, 200 licensed videos). However, it does not provide explicit training/validation/test splits (e.g., percentages or counts) for the primary datasets used for model training (HDTF, VFHQ, Ner Semble). |
| Hardware Specification | Yes | The training is conducted on 8 Nvidia A100 GPUs using the Adam W optimizer Yao et al. (2021) with a learning rate of 1e 5. For generating a 1-second video at 25 frames per second, the process takes approximately 20 seconds and requires 24 GB of memory. |
| Software Dependencies | No | The paper mentions using the Adam W optimizer and a pretrained VGG-19 model, but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | The training is conducted on 8 Nvidia A100 GPUs using the Adam W optimizer Yao et al. (2021) with a learning rate of 1e 5. We use a batch size of 64 for appearance and motion control training, and a batch size of 16 for the temporal module using 24-frame video sequences. During inference, we use 25 DDIM steps Song et al. (2020a) with a classifier-free guidance (CFG) scale of 3.5. Lgan = Ladv + λr Lrecon + λvgg Lvgg + λvggf Lvggf + λfm Lfm where λr=1.0, λvgg=3e-2, λvggf=6e-3 and λfm=10.0. |