How to Move Your Dragon: Text-to-Motion Synthesis for Large-Vocabulary Objects

Authors: Wonkwang Lee, Jongwon Jeong, Taehong Moon, Hyeon-Jong Kim, Jaehyeon Kim, Gunhee Kim, Byeong-Uk Lee

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that our method learns to generate high-fidelity motions from textual descriptions for diverse and even unseen objects, setting a strong foundation for motion synthesis across diverse object categories and skeletal templates. Qualitative results are available on this link. ... Extensive experiments on Truebones Zoo dataset demonstrate our framework s ability to generate high-fidelity motions conditioned on textual descriptions, or even synthesize motions for novel objects downloaded from the web.
Researcher Affiliation Collaboration 1Seoul National University 2KRAFTON 3NVIDIA. Correspondence to: Byeong-Uk Lee <EMAIL>.
Pseudocode No The paper describes methods and processes in text and mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes To inspire future work and further advancements, we will release the code for our data and model pipelines, along with the annotated captions, establishing a comprehensive benchmark for motion synthesis across diverse objects with heterogeneous skeletal structures.
Open Datasets Yes To address the first challenge, we utilize the Truebones Zoo dataset (Truebones, 2022), which contains over 1,000 artist-created animated armature meshes in FBX format, as illustrated in Figure 1.
Dataset Splits Yes To evaluate the pose synthesis model, we aggregate all motion data for each object category and extract their poses. We then apply clustering, generating 30 distinct pose clusters per object. Three clusters are randomly selected as the test pose set, while the remaining clusters are used for training. For motion synthesis evaluation, we randomly select one motion per object for the test set, with the remaining motions used for training.
Hardware Specification Yes All models were trained on a Linux system equipped with either an NVIDIA RTX A6000 (48GB) or A100 (40GB) GPU.
Software Dependencies No The paper mentions using GPT-4o and Sig LIP-SO400M-patch14-384 (Zhai et al., 2023) but does not provide specific version numbers for these or other software libraries/environments.
Experiment Setup Yes The pose diffusion model required approximately 29GB of VRAM with a batch size of 512 over 400K iterations, completing training in roughly 30 hours. The motion diffusion model used about 38GB of VRAM with a batch size of 4 and sequence length of 90, trained for 1M iterations over approximately 4 days. ... For motions containing more than 90 frames, we randomly sample a chunk of frames during training.