MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance
Authors: Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, Hao Jiang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive quantitative and qualitative experiments affirm that this method surpasses existing models in both image and text fidelity, promoting the development of personalized text-to-image generation. The experimental results demonstrate our method consistently outperforms the state-of-the-art approaches on all the benchmarks. We conclude the previous P-T2I works and provide an overall comparison in Table 1. The paper further delineates comprehensive ablation studies, underpinning the rationale behind our design decisions and affirming the efficacy of our proposed approach. For training, we utilize an in-house video dataset that contains 3.6M video clips. For evaluation, we measure the single-subject and multi-subject performance on Dream Bench (Ruiz et al., 2023) and MS-Bench, respectively. |
| Researcher Affiliation | Collaboration | Xierui Wang2,+ Siming Fu2,+ Qihan Huang2 Wanggui He1 Hao Jiang1, 1Alibaba Group 2Zhejiang University +Equal contribution Corresponding author |
| Pseudocode | No | The paper describes methods like Grounding Resampler and Multi-subject Cross-attention using mathematical formulations and textual descriptions, but it does not include any clearly labeled pseudocode blocks or algorithms. |
| Open Source Code | No | The project page is https://MS-Diffusion.github.io. This is a project page, not a direct link to a code repository. The paper does not explicitly state that the code is released or provide a specific code repository link. |
| Open Datasets | Yes | For evaluation, we measure the single-subject and multi-subject performance on Dream Bench (Ruiz et al., 2023) and MS-Bench, respectively. |
| Dataset Splits | No | The paper mentions using an in-house video dataset for training (3.6M video clips) and Dream Bench and MS-Bench for evaluation. It describes the composition of MS-Bench with '1148 combinations and 4488 evaluation samples'. However, it does not explicitly provide specific training/test/validation splits (e.g., percentages or exact counts) for any of these datasets in a reproducible manner. While Dream Bench is a known dataset, its splits are not detailed in this paper, and for MS-Bench, only its evaluation sample count is given, not a full split for reproduction. |
| Hardware Specification | Yes | Implemented by Pytorch 2.0.1 and Diffusers 0.23.1, our model is trained on 16 A100 GPUs for 120k steps with a batch size of 8 and a learning rate of 1e-4. |
| Software Dependencies | Yes | Implemented by Pytorch 2.0.1 and Diffusers 0.23.1, our model is trained on 16 A100 GPUs for 120k steps with a batch size of 8 and a learning rate of 1e-4. |
| Experiment Setup | Yes | The pre-trained model employed in MS-Diffusion is Stable Diffusion XL (SDXL) (Podell et al., 2023). Implemented by Pytorch 2.0.1 and Diffusers 0.23.1, our model is trained on 16 A100 GPUs for 120k steps with a batch size of 8 and a learning rate of 1e-4. Following the training of IP-adapter (Ye et al., 2023), we set γ = 1.0 in cross-attention layers and dropped the text and image condition using the same probability. To ensure the model is not dependent on the grounding tokens (Section 3.4), we also randomly drop them with a probability of 0.1. We generate five images for each sample during the inference, with unconditional guidance scale and γ set to 7.5 and 0.6, respectively, to get better results. |