EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions
Authors: Zhiyuan Chen, Jiajiong Cao, Zhiquan Chen, Yuming Li, Chenguang Ma
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Echo Mimic has been comprehensively compared with alternative algorithms across various public datasets and our collected dataset, showcasing superior performance in both quantitative and qualitative evaluations. |
| Researcher Affiliation | Industry | Terminal Technology Department, Alipay, Ant Group, Hangzhou, China EMAIL |
| Pseudocode | No | The paper describes methods and a model architecture, but it does not contain any clearly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | The code and models are available on the project page. Project Page https://antgroup.github.io/ai/echomimic |
| Open Datasets | Yes | Echo Mimic is extensively compared with alternative algorithms across diverse public datasets and our collected dataset, demonstrating superior performance in both quantitative and qualitative evaluations. Datasets. We collected approximately 540 hours (about 130,000 15-second video clips) of talking head videos, augmented with the HDTF and Celeb V-HQ datasets. |
| Dataset Splits | Yes | A 90:10 split was used for identity data, with 90% for training. |
| Hardware Specification | Yes | Implementation Details. Experiments involved training and inference phases on a high-performance computing setup with 8 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions specific models and architectures like 'Whisper-Tiny model (Radford et al. 2023)', 'Stable Diffusion (SD)', 'SDv1.5 architecture', 'CLIP (Radford et al. 2021) ViTL/14 text encoder', and 'pre-trained Animatediff weights'. However, it does not provide specific version numbers for ancillary software components (e.g., programming languages, libraries, or frameworks like Python, PyTorch, or CUDA versions). |
| Experiment Setup | Yes | Training comprised two segments of 30,000 steps each, using a batch size of 4 with 512 512 pixel video data. ... A constant learning rate of 1e-5 was used. |