HGM³: Hierarchical Generative Masked Motion Modeling with Hard Token Mining
Authors: Minjae Jeong, Yechan Hwang, Jaejin Lee, Sungyoon Jung, Won Hwa Kim
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on benchmark datasets, including Human ML3D and KIT-ML, demonstrate that our method outperforms existing methods in both qualitative and quantitative measures for generating context-aware motions. |
| Researcher Affiliation | Academia | Pohang University of Science and Technology (POSTECH), Pohang, South Korea EMAIL |
| Pseudocode | No | The paper describes the methods in detailed paragraphs and uses figures (Fig. 1, Fig. 2, Appendix Fig. 1, Appendix Fig. 2, Appendix Fig. 3) to illustrate processes, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | To ensure the reproducibility of our work, we present a detailed illustration of the training and inference process of our Hierarchical Generative Masked Motion Model with HTM in Fig. 1 and Fig. 2. Furthermore, we provide the implementation details for HGM3 and the experimental results replicated 20 times and the average with a 95% confidence interval on Human ML3D and KIT-ML datasets. We will release the full code and setup to facilitate reproducibility of our work. |
| Open Datasets | Yes | We evaluate HGM3 on two widely-used text-to-motion datasets: Human ML3D (Guo et al., 2022a) and KIT Motion-Language (KIT-ML) (Plappert et al., 2016). |
| Dataset Splits | Yes | Both datasets are split into training, validation, and test sets with proportions of 80%, 5%, and 15%, respectively. |
| Hardware Specification | Yes | All our experiments are conducted on a single NVIDIA RTX 6000 Ada Generation GPU. |
| Software Dependencies | No | The paper mentions using a pre-trained CLIP model (ViT-B-32 CLIP) and an AdamW optimizer, but it does not specify any programming languages, libraries (e.g., PyTorch, TensorFlow), or other software with their specific version numbers. |
| Experiment Setup | Yes | The residual VQ-VAE contains 6 quantization layers, each with a codebook of 512 codes of 512 dimensions. The downsampling rate N/n of VAE encoder is set to 4 and latent dimension is set to 384. β in Lrvq is set to 0.02. We use the Vi T-B-32 CLIP model, where the dimension of the text representation is set to 512. All transformers consist of 6 transformer layers with 6 attention heads. For the HTM implementation, α0 and αT are set to 0 and 0.5, respectively. Our models are trained using the Adam W optimizer for 500 epochs. The learning rate is linearly warmed up to 2e-4 over 2000 iterations. The batch size is set to 512 for training residual VQ-VAE, and 256 for training the masked transformer and the transformer predicting the reconstruction loss. For training the residual transformer, the batch size is set to 64 for Human ML3D and 32 for KITML. During the inference process, LM, LA, and L are set to 2, 5, and 10, respectively. |