LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning
Authors: Zhe Li, Weihao Yuan, Yisheng He, Lingteng Qiu, Shenhao Zhu, Xiaodong Gu, Weichao Shen, Yuan Dong, Zilong Dong, Laurence Yang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results on multiple datasets demonstrate substantial improvements over previous methods across all three tasks. Extensive experiments conducted across various datasets demonstrate the effectiveness of our approach in text-to-motion generation, motion-text retrieval, and motion-to-text captioning, with significant improvements compared to previous state-of-the-art methods. |
| Researcher Affiliation | Collaboration | 1 Huazhong University of Science and Technology 2 Alibaba Group 3 Nanjing University |
| Pseudocode | No | The paper describes the methods in prose and does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Project Page: https://aigc3d.github.io/LaMP/ The provided link is a project page, which does not explicitly state that the source code for the methodology described in this paper is available, nor does it link directly to a code repository. |
| Open Datasets | Yes | We evaluate our model on Human ML3D (Guo et al., 2022a) and KIT-ML (Plappert et al., 2016) datasets. |
| Dataset Splits | Yes | we allocate 23,384 samples for training, 1,460 for validation, and 4,383 for testing within Human ML3D, and utilize 4,888 for training, 300 for validation, and 830 for testing in KIT-ML. |
| Hardware Specification | Yes | Our model is implemented on NVIDIA A100 GPU using PyTorch. |
| Software Dependencies | No | Our model is implemented on NVIDIA A100 GPU using PyTorch. This mentions PyTorch but does not specify a version number. |
| Experiment Setup | Yes | For the motion VQ-VAE, we employ resblocks for both the encoder and decoder, with a downscale factor of 4. The VQ consists of 6 quantization layers, where each layer s codebook contains 512 512-dimensional codes. The quantization dropout ratio p is set to 0.2. The masked transformer is composed of 6 transformer layers with casual attention masks, 6 heads, and a latent dimension of 384. The learning rate reaches 2e-4 after 2000 iterations with a linear warm-up schedule for the training of all models. During inference, we set the CFG scale of mask transformer as 4 on Human ML3D, and 2 on KIT-ML. Meanwhile, K was set to 10 on both datasets. |