Scaling Large Motion Models with Million-Level Human Motions

Authors: Ye Wang, Sipeng Zheng, Bin Cao, Qianshan Wei, Weishuai Zeng, Qin Jin, Zongqing Lu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To address this gap, we present Motion Lib, the first million-level dataset for motion generation, which is at least 15 larger than existing counterparts and enriched with hierarchical text descriptions. Using Motion Lib, we train a large motion model named Being-M0, demonstrating robust performance across a wide range of human activities, including unseen ones. Through systematic investigation, for the first time, we highlight the importance of scaling both data and model size for advancing motion generation, along with key insights to achieve this goal.
Researcher Affiliation Collaboration 1Renmin University of China 2Beijing Academy of Artificial Intelligence 3Institute of Automation, Chinese Academy of Sciences 4Southeast University 5Peking University 6Being Beyond. Correspondence to: Zongqing Lu <EMAIL>.
Pseudocode No The paper describes methods and procedures in narrative text and flowcharts, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No For further details, visit https://beingbeyond. github.io/Being-M0/.
Open Datasets Yes In this paper, we aim to address the question: Can scaling the large motion model and data benefit motion generation? To tackle this, we develop a systematic data collection pipeline to build Motion Lib, the first large-scale dataset containing over 1.2M motion sequences at least 15 larger than current counterparts. This initiative provides a solid foundation for building robust, universally applicable motion models and offers a comprehensive testbed for future research.
Dataset Splits Yes Following standard practice, each dataset is split into training, validation, and test sets in proportions of 85%, 5%, and 15%, respectively.
Hardware Specification Yes For training the large motion model, full parameter tuning is performed on 8 A800 GPUs with a batch size of 1024 over 100 epochs.
Software Dependencies No The paper mentions models like GPT2-medium, LLa MA2-7b, LLa MA2-13b, and LLa MA3.1-8b, but does not provide specific version numbers for software libraries or dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes For the motion tokenizer, we implement the VQ codebook C R1024 512 with an embedding dimensionality of d = 512. The resulting discrete codes are incorporated as additional vocabulary for the LLM. As a comparison, the LFQ codebook has a size of 216 = 16384. The motion encoder E uses a temporal downsampling rate of α = 4. We experiment with four large language model (LLM) architectures to construct our large motion model: GPT2-medium (Radford et al., 2019), LLa MA2-7b, LLa MA2-13b (Touvron et al., 2023), and LLa MA3.1-8b (Dubey et al., 2024). The motion tokenizer is trained with a learning rate of 1e-4 and a batch size of 256 for 300K iterations. For training the large motion model, full parameter tuning is performed on 8 A800 GPUs with a batch size of 1024 over 100 epochs. The learning rate is set to 2e-4 for GPT2-medium and 2e-5 for the LLa MA models.