Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs

Authors: Qi Wu, Yubo Zhao, Yifan Wang, Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We assess our Motion-Agent framework with general and complex conversational user inputs, demonstrating its ability to handle intricate, multi-turn interactions. We also evaluate Motion LLM on single-turn motion generation and motion captioning tasks. 4.1 EXPERIMENT SETUP Datasets. Our experiments on Motion LLM are conducted with KIT Motion Language Dataset (KIT-ML) (Plappert et al., 2016), Human ML3D (Guo et al., 2022a). KIT-ML contains 3,911 human motion sequences, while Human ML3D dataset, obtained from AMASS (Mahmood et al., 2019) and Human Act12 (Guo et al., 2020), contains 14,616 human motions sequences with 44,970 textual descriptions. Evaluation Metric. For motion generation, we follow T2M (Guo et al., 2022a). Global representations of motion and textual descriptions are first extracted with the pre-trained network in (Guo et al., 2022a) and then measured in the following: 1) Text matching: R-precision (Top-1, Top-2, and Top-3 accuracy) by ranking Euclidean distances between motion and text embeddings, and MM Dist, which measures the average distance between text and generated motion embeddings. 2) Generation diversity: quantifies the variance of generated motions across all descriptions. 3) Motion fidelity: FID assesses the distance between the distribution of real and generated motions, reflecting how closely they match real motion distributions. For motion captioning, we follow TM2T (Guo et al., 2022b) to evaluate the quality of motion captioning by facilitating linguistic metrics from natural language studies, including Bleu (Papineni et al., 2002), Rouge (Lin, 2004), Cider (Vedantam et al., 2015), and Bert Score (Zhang et al., 2020). 4.2 RESULTS OF MOTION-AGENT In this section, we present the results of our Motion-Agent framework, demonstrating its ability to generate long outputs through complex combinations of tasks via multi-turn conversations. It is important to note that no established ground truth exists for such tasks, aside from text-motion translation, where we do not conduct additional training for these extended tasks. 4.3 EVALUATIONS OF MOTIONLLM We evaluate Motion LLM on both text-to-motion and motion-to-text tasks to validate that it achieves satisfactory results. Motion LLM is focused on enabling bidirectional translation with minimal training load, while still maintaining competitive performance across key benchmarks. 4.4 ABLATION STUDY Ablation on Motion-Agent Theoretically, the Motion LLM agent in our Motion-Agent framework can be replaced with any model capable of motion-text translation. However, models like Mo Mask (Guo et al., 2024), which require manual motion length input, may encounter issues (see Sec A.1.2), making autoregressive models preferable.
Researcher Affiliation Academia Qi Wu1 , Yubo Zhao1 , Yifan Wang1, Xinhang Liu1, Yu-Wing Tai2, Chi-Keung Tang1 1The Hong Kong University of Science and Technology 2 Dartmouth College
Pseudocode No The paper describes the Motion-Agent pipeline and its components in Section 3 and Figure 2, but does not provide structured pseudocode or algorithm blocks. The processes are explained in natural language.
Open Source Code No Project page: https://knoxzhao.github.io/Motion-Agent. The paper provides a project page URL, but it is not a direct link to a code repository and does not explicitly state that the source code for their methodology is released. It mentions using 'Gemma2-2b-it... a lightweight open-source LLM', but this refers to a third-party tool they utilized, not their own implementation code.
Open Datasets Yes Our experiments on Motion LLM are conducted with KIT Motion Language Dataset (KIT-ML) (Plappert et al., 2016), Human ML3D (Guo et al., 2022a).
Dataset Splits No Our experiments on Motion LLM are conducted with KIT Motion Language Dataset (KIT-ML) (Plappert et al., 2016), Human ML3D (Guo et al., 2022a). For motion generation, we follow T2M (Guo et al., 2022a). Table 2: Quantitative evaluation of Motion LLM on the Human ML3D (Guo et al., 2022a) test set.
Hardware Specification Yes All of our experiments are conducted on NVIDIA RTX4090s.
Software Dependencies No The paper mentions specific LLM models like 'GPT-4 (Achiam et al., 2023)', 'Gemma2-2b-it (Team et al., 2024b)', 'Llama (Touvron et al., 2023)', 'Gemma (Team et al., 2024b)', and 'Mixtral (Jiang et al., 2024a)'. However, it does not provide specific version numbers for ancillary software libraries or frameworks like Python, PyTorch, or CUDA.
Experiment Setup Yes In our tokenizer, we set the downsampling rate N to 4, the hidden dimension d to 512, and the codebook size K to 512. The weighting factors α and β for Lp and Lcommit are set to 0.5 and 0.02 respectively. For Motion LLM, we employ Gemma2-2b-it (Team et al., 2024b), a lightweight open-source LLM from Google, which offers accessibility and can be deployed on a single consumer-level GPU. The Lo RA rank is set to 64 for generation and 32 for captioning, the values of alpha remain the same with the rank. All of our experiments are conducted on NVIDIA RTX4090s. Table 7: Hyper-parameters of our models used in our main experiments. Hyper-parameter Motion Generation Motion Captioning Batch size 6 6 Learning rate 1e-5 1e-5 Lo RA rank 64 32 Lo RA alpha 32 32 Lo RA dropout 0.1 0.1 Codebook size 512 512 Codebook dim 512 512 Total vocab size 256514 256514