reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Authors: Yanbo Ding, Shaobin Zhuang, Kunchang Li, Zhengrong Yue, Yu Qiao, Yali Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show the stateof-the-art performance for MUSES on both T2I-Comp Bench and T2I-3Dis Bench, outperforming recent strong competitors such as DALL-E 3 and Stable Diffusion 3. These results demonstrate a significant step forward of MUSES in bridging natural language, 2D image generation, and 3D world.
Researcher Affiliation	Academia	1Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Shanghai Artificial Intelligence Laboratory 4Shanghai Jiao Tong University EMAIL
Pseudocode	No	The paper describes the workflow of MUSES with three key components (Layout Manager, Model Engineer, Image Artist) and details their functions in paragraph text and figures, but it does not present any formal pseudocode blocks or algorithms.
Open Source Code	Yes	Code https://github.com/DINGYANB/MUSES
Open Datasets	Yes	Dataset https://huggingface.co/yanboding/MUSES Extended version https://arxiv.org/abs/2408.10605
Dataset Splits	Yes	To balance positive and negative samples during fine-tuning, we make the face-camera image as 5 copies for each 3D model. This results in a training set with 1500 image-text pairs, which are used for fine-tuning CLIP as a Face-Camera Classifier by contrastive language-image learning. After fine-tuning, we test it with an extra 1500 images (50% face camera, 50% not) from other 150 3D models. All test images are correctly classified.
Hardware Specification	Yes	Experiments are conducted on 8 NVIDIA RTX 3090 GPUs.
Software Dependencies	Yes	In our experimental setup, we employed Llama-3-8B (AI@Meta 2024) for 3D layout planning, Vi TL/14 for image/text encoding, Vi T-B/32 for orientation calibration, and SD 3 Control Net (Zhang, Rao, and Agrawala 2023) for controllable image generation... We use Mini-Intern VL 1.5 (Chen et al. 2024a) for automated evaluation on T2I-3Dis Bench.
Experiment Setup	Yes	For Llama-3-8B, we set top p to 0.1 and temperature to 0.2 to ensure precise, consistent, and reliable outputs. For SD 3 Control Net, we set inference steps to 20 and control scales from 0.5 to 0.9, as discussed in parameter ablation in Extended Version.