Co$^{\mathbf{3}}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion
Authors: Xingqun Qi, Yatian Wang, Hengyuan Zhang, Jiahao Pan, Wei Xue, Shanghang Zhang, Wenhan Luo, Qifeng Liu, Yike Guo
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected GES-Inter dataset. The dataset and source code are publicly available at https://mattie-e.github.io/Co3/. |
| Researcher Affiliation | Academia | 1 The Hong Kong University of Science and Technology 2 Peking University |
| Pseudocode | No | The paper describes the methodology using prose and mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The dataset and source code are publicly available at https://mattie-e.github.io/Co3/. |
| Open Datasets | Yes | To fulfill this goal, we first construct a large-scale concurrent co-speech gesture dataset that contains more than 7M frames for diverse two-person interactive posture sequences, dubbed GES-Inter. ... The dataset and source code are publicly available at https://mattie-e.github.io/Co3/. |
| Dataset Splits | Yes | Finally, we acquire 27,390 motion clips that are split into training/ validation/ testing following criteria (Liu et al., 2022a; 2024a) as 85%/ 7.5%/ 7.5%. |
| Hardware Specification | Yes | The extraction takes 8 NVIDIA RTX 4090 GPUs in one month, obtaining 20 million raw frames. ... Our model is applied on a single NVIDIA H800 GPU with a batch size of 128. |
| Software Dependencies | No | The paper mentions tools like pyannote-audio, Whisper-X, and Montreal Forced Aligner (MFA) used for data processing, and an AdamW optimizer, but it does not provide specific version numbers for these or other key software libraries used in the implementation of their method. |
| Experiment Setup | Yes | We set the total generated sequence length N = 90 with the FPS normalized as 15 in the experiments. ... The dimension of input audio mel-spectrograms is 128 × 186. ... Each branch of our pipeline is implemented with 8 blocks within 8 heads of attention layers. The latent dimension D is set to 768. In the training stage, we set λsimple = 15, empirically. The initial learning rate is set as 1 × 10−4 with an Adam W optimizer. Similar to Nichol & Dhariwal (2021), we set the diffusion time step as 1,000 with the cosine noisy schedule. ... During inference, we adopt DDIM Song et al. (2020) sampling strategy with 50 denoising timesteps to produce gestures. ... Our Co3Gesture synthesizes upper body movements containing 46 joints (i.e., 16 body joints + 30 hand joints) of each speaker. Each joint is converted to a 6D rotation representation Zhou et al. for more stable modeling. The dimension of the generated motion sequence is R90×276, where 90 denotes frame number and 276 = 46 × 6 means upper body joints. |