Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration
Authors: Yanbo Ding, Shaobin Zhuang, Kunchang Li, Zhengrong Yue, Yu Qiao, Yali Wang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show the stateof-the-art performance for MUSES on both T2I-Comp Bench and T2I-3Dis Bench, outperforming recent strong competitors such as DALL-E 3 and Stable Diffusion 3. These results demonstrate a significant step forward of MUSES in bridging natural language, 2D image generation, and 3D world. |
| Researcher Affiliation | Academia | 1Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Shanghai Artificial Intelligence Laboratory 4Shanghai Jiao Tong University EMAIL |
| Pseudocode | No | The paper describes the workflow of MUSES with three key components (Layout Manager, Model Engineer, Image Artist) and details their functions in paragraph text and figures, but it does not present any formal pseudocode blocks or algorithms. |
| Open Source Code | Yes | Code https://github.com/DINGYANB/MUSES |
| Open Datasets | Yes | Dataset https://huggingface.co/yanboding/MUSES Extended version https://arxiv.org/abs/2408.10605 |
| Dataset Splits | Yes | To balance positive and negative samples during fine-tuning, we make the face-camera image as 5 copies for each 3D model. This results in a training set with 1500 image-text pairs, which are used for fine-tuning CLIP as a Face-Camera Classifier by contrastive language-image learning. After fine-tuning, we test it with an extra 1500 images (50% face camera, 50% not) from other 150 3D models. All test images are correctly classified. |
| Hardware Specification | Yes | Experiments are conducted on 8 NVIDIA RTX 3090 GPUs. |
| Software Dependencies | Yes | In our experimental setup, we employed Llama-3-8B (AI@Meta 2024) for 3D layout planning, Vi TL/14 for image/text encoding, Vi T-B/32 for orientation calibration, and SD 3 Control Net (Zhang, Rao, and Agrawala 2023) for controllable image generation... We use Mini-Intern VL 1.5 (Chen et al. 2024a) for automated evaluation on T2I-3Dis Bench. |
| Experiment Setup | Yes | For Llama-3-8B, we set top p to 0.1 and temperature to 0.2 to ensure precise, consistent, and reliable outputs. For SD 3 Control Net, we set inference steps to 20 and control scales from 0.5 to 0.9, as discussed in parameter ablation in Extended Version. |