Co$^{\mathbf{3}}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion

Authors: Xingqun Qi, Yatian Wang, Hengyuan Zhang, Jiahao Pan, Wei Xue, Shanghang Zhang, Wenhan Luo, Qifeng Liu, Yike Guo

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected GES-Inter dataset. The dataset and source code are publicly available at https://mattie-e.github.io/Co3/.
Researcher Affiliation Academia 1 The Hong Kong University of Science and Technology 2 Peking University
Pseudocode No The paper describes the methodology using prose and mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The dataset and source code are publicly available at https://mattie-e.github.io/Co3/.
Open Datasets Yes To fulfill this goal, we first construct a large-scale concurrent co-speech gesture dataset that contains more than 7M frames for diverse two-person interactive posture sequences, dubbed GES-Inter. ... The dataset and source code are publicly available at https://mattie-e.github.io/Co3/.
Dataset Splits Yes Finally, we acquire 27,390 motion clips that are split into training/ validation/ testing following criteria (Liu et al., 2022a; 2024a) as 85%/ 7.5%/ 7.5%.
Hardware Specification Yes The extraction takes 8 NVIDIA RTX 4090 GPUs in one month, obtaining 20 million raw frames. ... Our model is applied on a single NVIDIA H800 GPU with a batch size of 128.
Software Dependencies No The paper mentions tools like pyannote-audio, Whisper-X, and Montreal Forced Aligner (MFA) used for data processing, and an AdamW optimizer, but it does not provide specific version numbers for these or other key software libraries used in the implementation of their method.
Experiment Setup Yes We set the total generated sequence length N = 90 with the FPS normalized as 15 in the experiments. ... The dimension of input audio mel-spectrograms is 128 × 186. ... Each branch of our pipeline is implemented with 8 blocks within 8 heads of attention layers. The latent dimension D is set to 768. In the training stage, we set λsimple = 15, empirically. The initial learning rate is set as 1 × 10−4 with an Adam W optimizer. Similar to Nichol & Dhariwal (2021), we set the diffusion time step as 1,000 with the cosine noisy schedule. ... During inference, we adopt DDIM Song et al. (2020) sampling strategy with 50 denoising timesteps to produce gestures. ... Our Co3Gesture synthesizes upper body movements containing 46 joints (i.e., 16 body joints + 30 hand joints) of each speaker. Each joint is converted to a 6D rotation representation Zhou et al. for more stable modeling. The dimension of the generated motion sequence is R90×276, where 90 denotes frame number and 276 = 46 × 6 means upper body joints.