ReMoGPT: Part-Level Retrieval-Augmented Motion-Language Models

Authors: Qing Yu, Mikihiro Tanaka, Kent Fujiwara

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate the efficacy of Re Mo GPT, showcasing its superiority over existing state-of-the-art methods. The framework performs well on multiple motion tasks, including motion retrieval, generation, and captioning. We evaluate the efficacy of Re Mo GPT on two standard motion generation benchmarks, namely Human ML3D (Guo et al. 2022a) and Motion-X (Lin et al. 2023).
Researcher Affiliation Industry LY Corporation EMAIL
Pseudocode No The paper illustrates prompt examples in Figure 5, which show 'Input', 'Context', and 'Output' structures, but it does not contain any formally labeled 'Pseudocode' or 'Algorithm' block with structured steps for a method or procedure.
Open Source Code No The paper states, "we implement our method based on the code of Motion GPT (Jiang et al. 2023)." This indicates reliance on third-party code but does not provide an explicit statement or link for the open-sourcing of Re Mo GPT's own implementation.
Open Datasets Yes We use two text-to-motion datasets: Human ML3D (Guo et al. 2022a) and Motion-X (Lin et al. 2023) in the experiments. Human ML3D is a dataset that includes 14,616 motion clips sourced from AMASS (Mahmood et al. 2019), along with 44,970 sequence-level textual descriptions.
Dataset Splits Yes We evaluate the efficacy of Re Mo GPT on two standard motion generation benchmarks, namely Human ML3D (Guo et al. 2022a) and Motion-X (Lin et al. 2023) in the experiments. ... The results of text-to-motion generation on Human ML3D are shown in Table 2... The results of rare motion generation... specifically for the top 5% of rare motions. ... the entire test set of Human ML3D.
Hardware Specification Yes All models are trained on 8 Tesla A100 GPUs.
Software Dependencies No The paper mentions using T5 (Raffel et al. 2020) as the base language model and the Adam W optimizer, and that the motion encoder uses a Transformer architecture, but it does not specify version numbers for any software libraries or programming languages used for implementation.
Experiment Setup Yes For the text-motion retrieval model, ... the dimension of each part-level motion embedding is set to 512. For the motion-language model, ... The size of the codebook for the motion tokenizer is set as 512 and the temporal downsampling rate l is set as 4 in the motion encoder. ... The Adam W optimizer is used in all the models for training. ... we further train the model with a learning rate of 10^-4 and a mini-batch size of 16 for 200 epochs.