reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

HGM³: Hierarchical Generative Masked Motion Modeling with Hard Token Mining

Authors: Minjae Jeong, Yechan Hwang, Jaejin Lee, Sungyoon Jung, Won Hwa Kim

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on benchmark datasets, including Human ML3D and KIT-ML, demonstrate that our method outperforms existing methods in both qualitative and quantitative measures for generating context-aware motions.
Researcher Affiliation	Academia	Pohang University of Science and Technology (POSTECH), Pohang, South Korea EMAIL
Pseudocode	No	The paper describes the methods in detailed paragraphs and uses figures (Fig. 1, Fig. 2, Appendix Fig. 1, Appendix Fig. 2, Appendix Fig. 3) to illustrate processes, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	To ensure the reproducibility of our work, we present a detailed illustration of the training and inference process of our Hierarchical Generative Masked Motion Model with HTM in Fig. 1 and Fig. 2. Furthermore, we provide the implementation details for HGM3 and the experimental results replicated 20 times and the average with a 95% confidence interval on Human ML3D and KIT-ML datasets. We will release the full code and setup to facilitate reproducibility of our work.
Open Datasets	Yes	We evaluate HGM3 on two widely-used text-to-motion datasets: Human ML3D (Guo et al., 2022a) and KIT Motion-Language (KIT-ML) (Plappert et al., 2016).
Dataset Splits	Yes	Both datasets are split into training, validation, and test sets with proportions of 80%, 5%, and 15%, respectively.
Hardware Specification	Yes	All our experiments are conducted on a single NVIDIA RTX 6000 Ada Generation GPU.
Software Dependencies	No	The paper mentions using a pre-trained CLIP model (ViT-B-32 CLIP) and an AdamW optimizer, but it does not specify any programming languages, libraries (e.g., PyTorch, TensorFlow), or other software with their specific version numbers.
Experiment Setup	Yes	The residual VQ-VAE contains 6 quantization layers, each with a codebook of 512 codes of 512 dimensions. The downsampling rate N/n of VAE encoder is set to 4 and latent dimension is set to 384. β in Lrvq is set to 0.02. We use the Vi T-B-32 CLIP model, where the dimension of the text representation is set to 512. All transformers consist of 6 transformer layers with 6 attention heads. For the HTM implementation, α0 and αT are set to 0 and 0.5, respectively. Our models are trained using the Adam W optimizer for 500 epochs. The learning rate is linearly warmed up to 2e-4 over 2000 iterations. The batch size is set to 512 for training residual VQ-VAE, and 256 for training the masked transformer and the transformer predicting the reconstruction loss. For training the residual transformer, the batch size is set to 64 for Human ML3D and 32 for KITML. During the inference process, LM, LA, and L are set to 2, 5, and 10, respectively.