HGM³: Hierarchical Generative Masked Motion Modeling with Hard Token Mining

Authors: Minjae Jeong, Yechan Hwang, Jaejin Lee, Sungyoon Jung, Won Hwa Kim

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on benchmark datasets, including Human ML3D and KIT-ML, demonstrate that our method outperforms existing methods in both qualitative and quantitative measures for generating context-aware motions.
Researcher Affiliation Academia Pohang University of Science and Technology (POSTECH), Pohang, South Korea EMAIL
Pseudocode No The paper describes the methods in detailed paragraphs and uses figures (Fig. 1, Fig. 2, Appendix Fig. 1, Appendix Fig. 2, Appendix Fig. 3) to illustrate processes, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No To ensure the reproducibility of our work, we present a detailed illustration of the training and inference process of our Hierarchical Generative Masked Motion Model with HTM in Fig. 1 and Fig. 2. Furthermore, we provide the implementation details for HGM3 and the experimental results replicated 20 times and the average with a 95% confidence interval on Human ML3D and KIT-ML datasets. We will release the full code and setup to facilitate reproducibility of our work.
Open Datasets Yes We evaluate HGM3 on two widely-used text-to-motion datasets: Human ML3D (Guo et al., 2022a) and KIT Motion-Language (KIT-ML) (Plappert et al., 2016).
Dataset Splits Yes Both datasets are split into training, validation, and test sets with proportions of 80%, 5%, and 15%, respectively.
Hardware Specification Yes All our experiments are conducted on a single NVIDIA RTX 6000 Ada Generation GPU.
Software Dependencies No The paper mentions using a pre-trained CLIP model (ViT-B-32 CLIP) and an AdamW optimizer, but it does not specify any programming languages, libraries (e.g., PyTorch, TensorFlow), or other software with their specific version numbers.
Experiment Setup Yes The residual VQ-VAE contains 6 quantization layers, each with a codebook of 512 codes of 512 dimensions. The downsampling rate N/n of VAE encoder is set to 4 and latent dimension is set to 384. β in Lrvq is set to 0.02. We use the Vi T-B-32 CLIP model, where the dimension of the text representation is set to 512. All transformers consist of 6 transformer layers with 6 attention heads. For the HTM implementation, α0 and αT are set to 0 and 0.5, respectively. Our models are trained using the Adam W optimizer for 500 epochs. The learning rate is linearly warmed up to 2e-4 over 2000 iterations. The batch size is set to 512 for training residual VQ-VAE, and 256 for training the masked transformer and the transformer predicting the reconstruction loss. For training the residual transformer, the batch size is set to 64 for Human ML3D and 32 for KITML. During the inference process, LM, LA, and L are set to 2, 5, and 10, respectively.