EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

Authors: Zhuofan Zong, Dongzhi Jiang, Bingqi Ma, Guanglu Song, Hao Shao, Dazhong Shen, Yu Liu, Hongsheng Li

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate Easy Ref surpasses both tuningfree and tuning-based methods, achieving superior aesthetic quality and robust zero-shot generalization across diverse domains.
Researcher Affiliation Collaboration 1CUHK MMLab 2Sense Time Research 3Nanjing University of Aeronautics and Astronautics 4Shanghai AI Laboratory 5CPII under Inno HK. Correspondence to: Hongsheng Li <EMAIL>.
Pseudocode No The paper describes methods using mathematical equations and textual explanations, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement from the authors about releasing their code, nor does it provide a direct link to a code repository for the methodology described.
Open Datasets Yes Our data source encompasses two parts: (1) We collect images from several large-scale publicly available datasets, including LAION-2B (Schuhmann et al., 2022), COYO-700M (Byeon et al., 2022), and Data Comp1B (Gadre et al., 2024). (2) We also constructed a tag list that includes celebrity names, character names, styles, and subjects, and collected filtered images from diverse sources based on this list. ... We use the checkpoint trained by single-reference finetuning. As shown in Table 2, Easy Ref consistently outperforms other methods in both CLIP-T and DINO-I metrics, demonstrating superior alignment performance. For instance, our model significantly surpasses the IP-Adapter-SDXL by 0.223 DINO-I score. Note that IP-Adapter utilizes CLIP image embeddings for conditioning, its generated images may exhibit a bias towards CLIP s preference, potentially increasing scores when evaluated using CLIP-based metrics. We further conduct qualitative comparisons using some reference images that encompass various consistent elements. As presented in Figure 6, our method achieves better aesthetic quality and consistency with the original image prompts. ... Then we present the single-entity subject-driven generation performance comparisons on the Dream Bench (Ruiz et al., 2023).
Dataset Splits Yes The collected image-text pairs are divided into the training dataset, the held-in evaluation set, and the held-out evaluation set. We first sample 50 image groups to construct the held-out evaluation set to evaluate the model performance on unseen data. The number of images in each set varies. There are a total of 300 images in the held-out evaluation set. ... we randomly selected 300 groups from the remaining 994,215 groups to form a test set of 1434 samples. ... All reference images of the held-in split and other 993,915 groups construct the training set. ... There are 2,117,435 valid training target images in the training set, with an average of 2.1 images per group.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU models, memory specifications) used for running the experiments.
Software Dependencies No The paper mentions using specific models like "Stable Diffusion XL (Podell et al., 2023)" and "Qwen2-VL-2B (Wang et al., 2024d)", and other components like "CLIP Vi T-L/14", "DINOv2-Small", "Co DETR", "Arc Face", and a "DDIM sampler". However, it does not specify version numbers for these software components or the underlying programming languages/libraries (e.g., PyTorch, CUDA versions) used for their implementation.
Experiment Setup Yes The model is pretrained for 300k iterations. We center crop 1024 1024 pixels of the input image. ... we only train the model for 80k iterations. ... We introduce 64 reference tokens in the MLLM. We also employ a drop probability of 0.1 for both text and image prompts independently, and a joint drop probability of 0.1 for simultaneous removal of both modalities. ... During inference, we leverage a DDIM (Song et al., 2020) sampler with 30 steps and a guidance scale (Ho & Salimans, 2022) of 7.5. For the implementation of Lo RA comparison, we fine-tuned the model using the reference images and employed a Lo RA rank of 32.