ReMoGPT: Part-Level Retrieval-Augmented Motion-Language Models
Authors: Qing Yu, Mikihiro Tanaka, Kent Fujiwara
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments validate the efficacy of Re Mo GPT, showcasing its superiority over existing state-of-the-art methods. The framework performs well on multiple motion tasks, including motion retrieval, generation, and captioning. We evaluate the efficacy of Re Mo GPT on two standard motion generation benchmarks, namely Human ML3D (Guo et al. 2022a) and Motion-X (Lin et al. 2023). |
| Researcher Affiliation | Industry | LY Corporation EMAIL |
| Pseudocode | No | The paper illustrates prompt examples in Figure 5, which show 'Input', 'Context', and 'Output' structures, but it does not contain any formally labeled 'Pseudocode' or 'Algorithm' block with structured steps for a method or procedure. |
| Open Source Code | No | The paper states, "we implement our method based on the code of Motion GPT (Jiang et al. 2023)." This indicates reliance on third-party code but does not provide an explicit statement or link for the open-sourcing of Re Mo GPT's own implementation. |
| Open Datasets | Yes | We use two text-to-motion datasets: Human ML3D (Guo et al. 2022a) and Motion-X (Lin et al. 2023) in the experiments. Human ML3D is a dataset that includes 14,616 motion clips sourced from AMASS (Mahmood et al. 2019), along with 44,970 sequence-level textual descriptions. |
| Dataset Splits | Yes | We evaluate the efficacy of Re Mo GPT on two standard motion generation benchmarks, namely Human ML3D (Guo et al. 2022a) and Motion-X (Lin et al. 2023) in the experiments. ... The results of text-to-motion generation on Human ML3D are shown in Table 2... The results of rare motion generation... specifically for the top 5% of rare motions. ... the entire test set of Human ML3D. |
| Hardware Specification | Yes | All models are trained on 8 Tesla A100 GPUs. |
| Software Dependencies | No | The paper mentions using T5 (Raffel et al. 2020) as the base language model and the Adam W optimizer, and that the motion encoder uses a Transformer architecture, but it does not specify version numbers for any software libraries or programming languages used for implementation. |
| Experiment Setup | Yes | For the text-motion retrieval model, ... the dimension of each part-level motion embedding is set to 512. For the motion-language model, ... The size of the codebook for the motion tokenizer is set as 512 and the temporal downsampling rate l is set as 4 in the motion encoder. ... The Adam W optimizer is used in all the models for training. ... we further train the model with a learning rate of 10^-4 and a mini-batch size of 16 for 200 epochs. |