Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation
Authors: Yuchi Wang, Junliang Guo, Jianhong Bai, Runyi Yu, Tianyu He, Xu Tan, Xu Sun, Jiang Bian
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that Instruct Avatar produces results that align well with both conditions, and outperforms existing methods in fine-grained emotion control, lip-sync quality, and naturalness. For experiments, we propose several tailored evaluation metrics to justify the model s performance on fine-grained facial emotion and motion control. Experimental results demonstrate that: (1) Instruct Avatar exhibits significant improvements in emotion control, lip-sync quality, and naturalness compared to previous baselines. |
| Researcher Affiliation | Academia | National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University EMAIL, EMAIL |
| Pseudocode | No | The paper describes its methodology through textual descriptions, diagrams (Figure 2, Figure 3, Figure 4), and mathematical formulations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured, code-like procedural steps. |
| Open Source Code | No | The paper provides a 'Demo' link (https://wangyuchi369.github.io/Instruct Avatar/) and an 'Extended version' link (https://arxiv.org/abs/2405.15758), but neither explicitly states that the source code for the methodology is being released or provides a direct link to a code repository. The 'Demo' link typically points to a project demonstration page, not necessarily a code repository. |
| Open Datasets | Yes | For emotional talking control, we augment the MEAD dataset (Wang et al. 2020) following the methods outlined in Sec. 3.2. ... For text-guided facial motion control, we leveraged the CC v1 dataset (Hazirbas et al. 2021), which offers paired data comprising instructions and corresponding action videos. To ensure effective lip synchronization, we also incorporated the HDTF dataset (Zhang et al. 2021), which has high-quality talking face recordings. The evaluation was conducted using MEAD for in-domain assessment and Talking Head 1KH (Wang, Mallya, and Liu 2021b) for out-of-domain evaluation. |
| Dataset Splits | Yes | For emotional talking control, we augment the MEAD dataset (Wang et al. 2020) following the methods outlined in Sec. 3.2. MEAD is a large-scale emotional talking face dataset featuring 8 emotion types and 3 intensity levels. We reserved 5 individuals for testing purposes and utilized the remaining data for training. |
| Hardware Specification | Yes | We adopt the Adam (Kingma and Ba 2014) optimizer and train our models on 8 V100 GPUs. |
| Software Dependencies | No | We use Conformer (Gulati et al. 2020) as the backbone of our diffusion-based motion generator. Specifically, the model comprises 12 Conformer blocks, with a hidden state size of 768. For encoding textual instructions, we apply CLIP-L/14 (Radford et al. 2021), and the Adapters are two layers MLPs. While specific models/architectures like Conformer and CLIP-L/14 are mentioned, no specific version numbers for these software components or underlying frameworks (e.g., PyTorch, TensorFlow, Python) are provided. |
| Experiment Setup | Yes | We use Conformer (Gulati et al. 2020) as the backbone of our diffusion-based motion generator. Specifically, the model comprises 12 Conformer blocks, with a hidden state size of 768. For encoding textual instructions, we apply CLIP-L/14 (Radford et al. 2021), and the Adapters are two layers MLPs. We adopt the Adam (Kingma and Ba 2014) optimizer and train our models on 8 V100 GPUs. |