How to Move Your Dragon: Text-to-Motion Synthesis for Large-Vocabulary Objects
Authors: Wonkwang Lee, Jongwon Jeong, Taehong Moon, Hyeon-Jong Kim, Jaehyeon Kim, Gunhee Kim, Byeong-Uk Lee
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our method learns to generate high-fidelity motions from textual descriptions for diverse and even unseen objects, setting a strong foundation for motion synthesis across diverse object categories and skeletal templates. Qualitative results are available on this link. ... Extensive experiments on Truebones Zoo dataset demonstrate our framework s ability to generate high-fidelity motions conditioned on textual descriptions, or even synthesize motions for novel objects downloaded from the web. |
| Researcher Affiliation | Collaboration | 1Seoul National University 2KRAFTON 3NVIDIA. Correspondence to: Byeong-Uk Lee <EMAIL>. |
| Pseudocode | No | The paper describes methods and processes in text and mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | To inspire future work and further advancements, we will release the code for our data and model pipelines, along with the annotated captions, establishing a comprehensive benchmark for motion synthesis across diverse objects with heterogeneous skeletal structures. |
| Open Datasets | Yes | To address the first challenge, we utilize the Truebones Zoo dataset (Truebones, 2022), which contains over 1,000 artist-created animated armature meshes in FBX format, as illustrated in Figure 1. |
| Dataset Splits | Yes | To evaluate the pose synthesis model, we aggregate all motion data for each object category and extract their poses. We then apply clustering, generating 30 distinct pose clusters per object. Three clusters are randomly selected as the test pose set, while the remaining clusters are used for training. For motion synthesis evaluation, we randomly select one motion per object for the test set, with the remaining motions used for training. |
| Hardware Specification | Yes | All models were trained on a Linux system equipped with either an NVIDIA RTX A6000 (48GB) or A100 (40GB) GPU. |
| Software Dependencies | No | The paper mentions using GPT-4o and Sig LIP-SO400M-patch14-384 (Zhai et al., 2023) but does not provide specific version numbers for these or other software libraries/environments. |
| Experiment Setup | Yes | The pose diffusion model required approximately 29GB of VRAM with a batch size of 512 over 400K iterations, completing training in roughly 30 hours. The motion diffusion model used about 38GB of VRAM with a batch size of 4 and sequence length of 90, trained for 1M iterations over approximately 4 days. ... For motions containing more than 90 frames, we randomly sample a chunk of frames during training. |