DisEnvisioner: Disentangled and Enriched Visual Prompt for Customized Image Generation

Authors: Jing He, Haodong Li, huyongzhe, Guibao Shen, Yingjie CAI, Weichao Qiu, YINGCONG CHEN

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate the superiority of our approach over existing methods in instruction response (editability), ID consistency, inference speed, and the overall image quality, highlighting the effectiveness and efficiency of Dis Envisioner. Comprehensive experiments validate Dis Envisioner s superiority in adhering to instructions, maintaining ID consistency, and inference speed, demonstrating its superior personalization capabilities and efficiency.
Researcher Affiliation Collaboration 1HKUST(GZ) 2HKUST 3Noah s Ark Lab EMAIL; EMAIL
Pseudocode No The paper describes the methodology using textual explanations and architectural diagrams (Figure 3 and 4) but does not include any specific pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes We use the training set of Open Images V6 (Kuznetsova et al., 2020) to train the Dis Envisioner. It contains about 14.61M annotated boxes across 1.74M images. The evaluation is carried out on the Dream Booth (Ruiz et al., 2023) dataset, which comprises 30 subjects and 158 images in total (4 6 images per subject).
Dataset Splits No The paper states that the training set of Open Images V6 is used for training and the Dream Booth dataset for evaluation, specifying the number of images and prompts used. However, it does not provide explicit details on how these datasets are split into training, validation, and test sets with specific percentages or counts for reproducibility of the data partitioning.
Hardware Specification Yes All experiments are conducted on 8 NVIDIA A800 GPUs using the Adam W optimizer (Loshchilov & Hutter, 2017) with a weight decay of 0.01.
Software Dependencies Yes Dis Envisioner is built upon Stable Diffusion v1.5 , employing Open CLIP Vi T-H/14 model as the image/text encoder. ... All experiments are conducted on 8 NVIDIA A800 GPUs using the Adam W optimizer (Loshchilov & Hutter, 2017) with a weight decay of 0.01. ... During inference, in addition, we use the DDIM sampler (Song et al., 2020) with 50 steps and the scale of classifier-free guidance is set to 5.0.
Experiment Setup Yes During training, Dis Visioner is configured with batch size of 160, learning rate of 5e-7 at the resolution of 256. The training steps is 120K. We set the token number ns = 1 and ni = 1, for subject-essential feature and -irrelevant respectively. The En Visioner employs the batch size of 40, learning rate of 1e-4 at the resolution of 512. The training steps is also 120K. The enriched token number is n s = 4 and n i = 4, with attention scale λs = 1.0 and λi = 1.0. All experiments are conducted on 8 NVIDIA A800 GPUs using the Adam W optimizer (Loshchilov & Hutter, 2017) with a weight decay of 0.01. To enable classifier-free guidance (Dhariwal & Nichol, 2021), we use a probability of 0.05 to drop the condition, both textual and visual. During inference, in addition, we use the DDIM sampler (Song et al., 2020) with 50 steps and the scale of classifier-free guidance is set to 5.0.