reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Co$^{\mathbf{3}}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion

Authors: Xingqun Qi, Yatian Wang, Hengyuan Zhang, Jiahao Pan, Wei Xue, Shanghang Zhang, Wenhan Luo, Qifeng Liu, Yike Guo

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected GES-Inter dataset. The dataset and source code are publicly available at https://mattie-e.github.io/Co3/.
Researcher Affiliation	Academia	1 The Hong Kong University of Science and Technology 2 Peking University
Pseudocode	No	The paper describes the methodology using prose and mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The dataset and source code are publicly available at https://mattie-e.github.io/Co3/.
Open Datasets	Yes	To fulfill this goal, we first construct a large-scale concurrent co-speech gesture dataset that contains more than 7M frames for diverse two-person interactive posture sequences, dubbed GES-Inter. ... The dataset and source code are publicly available at https://mattie-e.github.io/Co3/.
Dataset Splits	Yes	Finally, we acquire 27,390 motion clips that are split into training/ validation/ testing following criteria (Liu et al., 2022a; 2024a) as 85%/ 7.5%/ 7.5%.
Hardware Specification	Yes	The extraction takes 8 NVIDIA RTX 4090 GPUs in one month, obtaining 20 million raw frames. ... Our model is applied on a single NVIDIA H800 GPU with a batch size of 128.
Software Dependencies	No	The paper mentions tools like pyannote-audio, Whisper-X, and Montreal Forced Aligner (MFA) used for data processing, and an AdamW optimizer, but it does not provide specific version numbers for these or other key software libraries used in the implementation of their method.
Experiment Setup	Yes	We set the total generated sequence length N = 90 with the FPS normalized as 15 in the experiments. ... The dimension of input audio mel-spectrograms is 128 × 186. ... Each branch of our pipeline is implemented with 8 blocks within 8 heads of attention layers. The latent dimension D is set to 768. In the training stage, we set λsimple = 15, empirically. The initial learning rate is set as 1 × 10−4 with an Adam W optimizer. Similar to Nichol & Dhariwal (2021), we set the diffusion time step as 1,000 with the cosine noisy schedule. ... During inference, we adopt DDIM Song et al. (2020) sampling strategy with 50 denoising timesteps to produce gestures. ... Our Co3Gesture synthesizes upper body movements containing 46 joints (i.e., 16 body joints + 30 hand joints) of each speaker. Each joint is converted to a 6D rotation representation Zhou et al. for more stable modeling. The dimension of the generated motion sequence is R90×276, where 90 denotes frame number and 276 = 46 × 6 means upper body joints.