Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation

Authors: Ling-An Zeng, Guohong Huang, Gaojie Wu, Wei-Shi Zheng

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Compared to the state-of-the-art method, Mo Mask, our Light-T2M model features just 10% of the parameters (4.48M vs 44.85M) and achieves a 16% faster inference time (0.152s vs 0.180s), while surpassing Mo Mask with an FID of 0.040 (vs. 0.045) on Human ML3D dataset and 0.161 (vs. 0.228) on KIT-ML dataset. [...] 4 Experiments [...] 4.3 Comparison with State-of-the-arts [...] 4.4 Ablation Studies
Researcher Affiliation Academia Ling-An Zeng1, Guohong Huang1, Gaojie Wu1, Wei-Shi Zheng1 2* 1Sun Yat-sen University 2 Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods and architectures using figures and text but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code https://github.com/qinghuannn/light-t2m
Open Datasets Yes We conduct experiments on two most common public text-motion datasets, i.e., the Human ML3D dataset (Guo et al. 2022a) and the KIT-ML dataset (Plappert, Mandery, and Asfour 2016).
Dataset Splits Yes For both datasets, the preprocessing procedure and the train-test-validation split remain consistent with (Guo et al. 2022a).
Hardware Specification Yes Average Inference time (AIT) is calculated from the average across 100 samples using the same RTX 3090Ti GPU. [...] Our Light-T2M is optimized by Adamw (Loshchilov and Hutter 2019) with a learning rate of 2e-4, a cosine annealing schedule, and a batch size of 256 on 2 RTX 3090Ti GPUs.
Software Dependencies No The paper mentions several models and optimizers like Adamw, CLIP, Mamba, and Uni PC, but does not specify software environment dependencies such as Python, PyTorch/TensorFlow, or CUDA versions.
Experiment Setup Yes The max diffusion step T is 1000 and the linearly varying variances βt range from 10 4 to 10 2. During inference, we adopt Uni PC (Zhao et al. 2023) with 10 time steps for the fast sampling. The number of blocks N, the hidden dim D, and the downsampling factor S are 4, 256, and 8, respectively. The guidance scale s and the text dropout ratio τ are set to 4 and 0.2, respectively. Our Light-T2M is optimized by Adamw (Loshchilov and Hutter 2019) with a learning rate of 2e-4, a cosine annealing schedule, and a batch size of 256 on 2 RTX 3090Ti GPUs. Light-T2M is trained with 3000/5000 epochs on the Human ML3D/KIT-ML datasets.