DIDiffGes: Decoupled Semi-Implicit Diffusion Models for Real-time Gesture Generation from Speech
Authors: Yongkang Cheng, Shaoli Huang, Xuelin Chen, Jifeng Ning, Mingming Gong
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our method can generate high-quality, realistic, and natural gesture motions with fewer sampling steps. Our user study also confirms the superior quality of our rapidly generated results compared to methods requiring more sampling steps. Data and Representation. Our experiments employed three distinct high-quality 3D motion capture datasets: BEATs (Liu et al. 2022a), Zero EGGs (Ghorbani et al. 2023), and AIST++ (Li et al. 2021). Quantitative Comparison. It is well-known that evaluating a model s generative capability based solely on a limited number of generated examples is challenging; therefore, we introduce several metrics. (i) Frechet Gesture Distance (FGD)... (ii) We compute the number of frames generated per second... (iii) We also compared the Beats Alignment (BA) and Diversity (DIV)... Ablation Studies In this section, we investigated the impact of diffusion steps, reconstruction loss weight, and auxiliary forward loss on model performance. All ablation studies were conducted on the Zero EGGs dataset, and the results are presented in Table 3. |
| Researcher Affiliation | Collaboration | Yongkang Cheng1, 3, Shaoli Huang1*, Xuelin Chen1, Jifeng Ning3, Mingming Gong2,4 1Tencent AI Lab 2School of Mathematics and Statistics, The University of Melbourne 3College of Information Engineering, Northwest A&F University 4Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates |
| Pseudocode | No | The paper describes the methodology using text and diagrams (Figure 2) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its own source code, nor does it include a link to a code repository. It only mentions using 'open-source code' for other methods. |
| Open Datasets | Yes | Our experiments employed three distinct high-quality 3D motion capture datasets: BEATs (Liu et al. 2022a), Zero EGGs (Ghorbani et al. 2023), and AIST++ (Li et al. 2021). |
| Dataset Splits | Yes | We split the original dataset into training, validation, and test sets with proportions of 0.8, 0.1, and 0.1, respectively, and trained on the entire training set. |
| Hardware Specification | Yes | In our implementation, using a V100 GPU, we generated 80 frames of concurrent gestures in 0.4 seconds and 150 frames of dance motions in 0.88 seconds. For a fair comparison with contemporary methods, we conducted experiments on a single V100 GPU. By default, we utilized a single A100 GPU for model training. |
| Software Dependencies | No | The paper mentions that the framework is 'implemented exclusively using Pytorch' and that they developed 'a script for Blender (Community 2018)', but it does not specify the version of PyTorch or other key software dependencies required for replication. |
| Experiment Setup | Yes | For the concurrent gesture model, we trained the generator and discriminator for 80 hours using a batch size of 128 and learning rates of 3e-5 and 1.25e-4, respectively. The dance model was trained for 48 hours with a batch size of 128 and learning rates of 5e-5 and 1.5e-4. We set the default diffusion steps to 20... in CFG (Ho and Salimans 2022), we set the conditional weight to 3.5. In the denoising model, we set the weights of the KL loss and Geo Loss to 0.5 and 10, respectively. These hyperparameters were found to yield the best empirical results. |