Sitcom-Crafter: A Plot-Driven Human Motion Generation System in 3D Scenes

Authors: Jianqi Chen, Panwen Hu, Xiaojun Chang, Zhenwei Shi, Michael Kampffmeyer, Xiaodan Liang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental evaluations validate the system s ability to generate high-quality, diverse, and physically realistic motions, underscoring its potential for advancing creative workflows. We evaluate the performance of Sitcom-Crafter using open-source 3D scenes. The experimental results demonstrate that Sitcom-Crafter can generate high-quality, diverse, and well-physics-constrained human motions (see examples in Fig. 1). Our key contributions in this work are as follows: 1) we develop a comprehensive human motion generation system, Sitcom-Crafter, which supports the synthesis of diverse types of human motions guided by both 3D scene structures and long plot contexts. The system consists of three motion generation modules and five augmentation modules that provide a flexible approach to motion generation. 2) We introduce a novel self-supervised, sceneaware human-human interaction generation method within the generation modules. By synthesizing binary SDF points around the motion region, we incorporate surrounding scene information into the generator, addressing the motion-scene collision problem prevalent in existing methods. Additionally, we unify motion representation using marker points across the different generation modules, ensuring seamless integration and compatibility of the generated motions. 3) We design five augmentation modules to enhance the cohesiveness and quality of the generated motions and improve the system s user-friendliness. These include modules for plot interpretation and command distribution, motion synchronization, collision revision, hand pose retrieval, and motion retargeting. Experimental evaluations on open-source 3D scenes demonstrate that our system effectively synthesizes high-quality, diverse, and well-physics-constrained human motions.
Researcher Affiliation Academia 1Beihang University, 2The Chinese University of Hong Kong, Shenzhen, 3University of Technology Sydney 4Ui T the Arctic University of Norway, 5Sun Yat-sen University
Pseudocode No The paper describes methodologies and system components in detail through text and figures, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No Project page: https://windvchen.github.io/Sitcom-Crafter.
Open Datasets Yes Datasets. We trained our human-human interaction generation module primarily using the Inter Human (Liang et al., 2024) dataset, and also incorporated the Inter-X (Xu et al., 2024) dataset specifically for experiments exploring the effects of data scale (see Appendix G) due to the substantial training costs. For evaluating the quality of generated motions, we utilized 11 indoor scenes from the Replica dataset (Straub et al., 2019) as input 3D scenes in our system.
Dataset Splits Yes We train two regressors: a marker-to-SMPL regressor and a marker-to-SMPLX regressor. Both regressors are trained on the training and validation sets of the Inter Human dataset (Liang et al., 2024) and tested on the test set.
Hardware Specification Yes Training was performed using 4 RTX 4090 GPUs, totaling 384 GPU hours for training on Inter Gen, and extended to 1000 GPU hours when incorporating training on Inter-X.
Software Dependencies No The paper mentions software and models like Google Gemini 1.5, CLIP, Keemap package, and Blender, but it does not specify any version numbers for these components. For example, it mentions 'Google Gemini 1.5 (Reid et al., 2024)' but not a software version for Gemini 1.5 itself or Python/PyTorch versions.
Experiment Setup Yes The loss weights for each function in Eq. 3 were determined through hyperparameter tuning to achieve optimal performance, resulting in values of 1, 3, 0.001, 30, 30, 0.01, 3, 1, 1, respectively. We conducted training over 1,000 epochs in Phase1 with a batch size of 20 per GPU and a learning rate of 1e 4. In Phase2, we trained for 500 epochs using the same batch size but reduced the learning rate to 1e 5. For Phase3, we decreased the batch size to 6 per GPU due to increased memory costs from Lhuman P ene and further lowered the learning rate to 1e 6.