DisEnvisioner: Disentangled and Enriched Visual Prompt for Customized Image Generation
Authors: Jing He, Haodong Li, huyongzhe, Guibao Shen, Yingjie CAI, Weichao Qiu, YINGCONG CHEN
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate the superiority of our approach over existing methods in instruction response (editability), ID consistency, inference speed, and the overall image quality, highlighting the effectiveness and efficiency of Dis Envisioner. Comprehensive experiments validate Dis Envisioner s superiority in adhering to instructions, maintaining ID consistency, and inference speed, demonstrating its superior personalization capabilities and efficiency. |
| Researcher Affiliation | Collaboration | 1HKUST(GZ) 2HKUST 3Noah s Ark Lab EMAIL; EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations and architectural diagrams (Figure 3 and 4) but does not include any specific pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We use the training set of Open Images V6 (Kuznetsova et al., 2020) to train the Dis Envisioner. It contains about 14.61M annotated boxes across 1.74M images. The evaluation is carried out on the Dream Booth (Ruiz et al., 2023) dataset, which comprises 30 subjects and 158 images in total (4 6 images per subject). |
| Dataset Splits | No | The paper states that the training set of Open Images V6 is used for training and the Dream Booth dataset for evaluation, specifying the number of images and prompts used. However, it does not provide explicit details on how these datasets are split into training, validation, and test sets with specific percentages or counts for reproducibility of the data partitioning. |
| Hardware Specification | Yes | All experiments are conducted on 8 NVIDIA A800 GPUs using the Adam W optimizer (Loshchilov & Hutter, 2017) with a weight decay of 0.01. |
| Software Dependencies | Yes | Dis Envisioner is built upon Stable Diffusion v1.5 , employing Open CLIP Vi T-H/14 model as the image/text encoder. ... All experiments are conducted on 8 NVIDIA A800 GPUs using the Adam W optimizer (Loshchilov & Hutter, 2017) with a weight decay of 0.01. ... During inference, in addition, we use the DDIM sampler (Song et al., 2020) with 50 steps and the scale of classifier-free guidance is set to 5.0. |
| Experiment Setup | Yes | During training, Dis Visioner is configured with batch size of 160, learning rate of 5e-7 at the resolution of 256. The training steps is 120K. We set the token number ns = 1 and ni = 1, for subject-essential feature and -irrelevant respectively. The En Visioner employs the batch size of 40, learning rate of 1e-4 at the resolution of 512. The training steps is also 120K. The enriched token number is n s = 4 and n i = 4, with attention scale λs = 1.0 and λi = 1.0. All experiments are conducted on 8 NVIDIA A800 GPUs using the Adam W optimizer (Loshchilov & Hutter, 2017) with a weight decay of 0.01. To enable classifier-free guidance (Dhariwal & Nichol, 2021), we use a probability of 0.05 to drop the condition, both textual and visual. During inference, in addition, we use the DDIM sampler (Song et al., 2020) with 50 steps and the scale of classifier-free guidance is set to 5.0. |