LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning

Authors: Zhe Li, Weihao Yuan, Yisheng He, Lingteng Qiu, Shenhao Zhu, Xiaodong Gu, Weichao Shen, Yuan Dong, Zilong Dong, Laurence Yang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results on multiple datasets demonstrate substantial improvements over previous methods across all three tasks. Extensive experiments conducted across various datasets demonstrate the effectiveness of our approach in text-to-motion generation, motion-text retrieval, and motion-to-text captioning, with significant improvements compared to previous state-of-the-art methods.
Researcher Affiliation Collaboration 1 Huazhong University of Science and Technology 2 Alibaba Group 3 Nanjing University
Pseudocode No The paper describes the methods in prose and does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Project Page: https://aigc3d.github.io/LaMP/ The provided link is a project page, which does not explicitly state that the source code for the methodology described in this paper is available, nor does it link directly to a code repository.
Open Datasets Yes We evaluate our model on Human ML3D (Guo et al., 2022a) and KIT-ML (Plappert et al., 2016) datasets.
Dataset Splits Yes we allocate 23,384 samples for training, 1,460 for validation, and 4,383 for testing within Human ML3D, and utilize 4,888 for training, 300 for validation, and 830 for testing in KIT-ML.
Hardware Specification Yes Our model is implemented on NVIDIA A100 GPU using PyTorch.
Software Dependencies No Our model is implemented on NVIDIA A100 GPU using PyTorch. This mentions PyTorch but does not specify a version number.
Experiment Setup Yes For the motion VQ-VAE, we employ resblocks for both the encoder and decoder, with a downscale factor of 4. The VQ consists of 6 quantization layers, where each layer s codebook contains 512 512-dimensional codes. The quantization dropout ratio p is set to 0.2. The masked transformer is composed of 6 transformer layers with casual attention masks, 6 heads, and a latent dimension of 384. The learning rate reaches 2e-4 after 2000 iterations with a linear warm-up schedule for the training of all models. During inference, we set the CFG scale of mask transformer as 4 on Human ML3D, and 2 on KIT-ML. Meanwhile, K was set to 10 on both datasets.