Towards Unified Human Motion-Language Understanding via Sparse Interpretable Characterization
Authors: guangtao lyu, Chenghao Xu, Jiexi Yan, Muli Yang, Cheng Deng
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive analyses and extensive experiments across multiple public datasets demonstrate that our model achieves state-of-the-art performance across various tasks and scenarios. |
| Researcher Affiliation | Academia | 1 School of Electronic Engineering, Xidian University, Xi an, Shaanxi, China, 2 School of Computer Science and Technology, Xidian University, Xi an, Shaanxi, China, 3 Institute for Infocomm Research (I2R), A*STAR, Singapore |
| Pseudocode | No | The paper describes methods using equations and natural language, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing code, nor does it provide links to any code repositories. |
| Open Datasets | Yes | To validate the effectiveness of our sparse lexical representations, we conduct experiments on two commonly used public datasets: the Human ML3D (Guo et al., 2022a) dataset and the KIT Motion-Language dataset (Plappert et al., 2016). |
| Dataset Splits | Yes | The Human ML3D dataset extends the AMASS (Mahmood et al., 2019) and Human Act12 (Guo et al., 2020) motion capture datasets by adding natural language annotations, comprising 23,384 motions for training, 1,460 for validation, and 4,380 for testing. The KIT-ML dataset is focused primarily on locomotion, derived from motion capture data, 4,888 motions for training, 300 for validation, and 830 for testing. |
| Hardware Specification | Yes | We compare the training time of different models on the Human ML3D dataset using a single A6000 GPU and report the results in Table 12. |
| Software Dependencies | No | The paper mentions using pretrained BERT (Devlin, 2018) as a text encoder and a transformer (Vaswani et al., 2017) for the motion encoder, along with the Adam optimizer (Kingma & Ba, 2014). However, it does not specify concrete version numbers for any software libraries or programming languages used (e.g., PyTorch version, Python version). |
| Experiment Setup | Yes | We utilize pretrained BERT (Devlin, 2018) as our text encoder and implement a transformer (Vaswani et al., 2017) with spatial and temporal attention mechanisms for the motion encoder. Our experiments employ the Adam optimizer (Kingma & Ba, 2014), with learning rates set to 10-5 for the text encoder, 10-4 for the motion encoder, and 10-3 for the Lexical Disentanglement Head and Lexical Bottleneck Masked Decoder. During the Lex MLM phase, we train with a batch size of 128 for 50 epochs. In the CMMM phase, we use a batch size of 64 and train for 200 epochs. For the Lex MMM phase, we freeze the lexical space and fine-tune the motion encoder to align with the language domain, using a batch size of 64 for 150 epochs. Finally, in the Lex CMLP phase, we use a batch size of 64 and train for 20 epochs at a learning rate of 10-5. |