CustomContrast: A Multilevel Contrastive Perspective for Subject-Driven Text-to-Image Customization

Authors: Nan Chen, Mengqi Huang, Zhuowei Chen, Yang Zheng, Lei Zhang, Zhendong Mao

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show the effectiveness of Custom Contrast in subject similarity and text controllability. ... Our model, trained on SD-V1.5 and SDXL, outperforms corresponding advanced methods. Experiments show our model improves text controllability by 3.8% and 5.4% respectively, and subject similarity (E-DI) by 5.9% and 2.4%, while easily extending to multi-subject and human domain generation. ... To demonstrate the effectiveness of essential components of Custom Contrast, we conduct extensive ablation experiments.
Researcher Affiliation Academia University of Science and Technology of China EMAIL, EMAIL
Pseudocode No The paper describes the methodology using prose, equations, and diagrams, but it does not include explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes code https://cn-makers.github.io/Custom Contrast/
Open Datasets Yes The filtered subset of MVImage Net (Yu et al. 2023) and Open Image (Kuznetsova et al. 2020) are used as training sets.
Dataset Splits No The paper states that MVImage Net and Open Image are used as "training sets" but does not provide specific details on how the datasets were split into training, validation, or test sets, nor does it mention percentages or sample counts for these splits.
Hardware Specification Yes The model is trained on 6 A100 GPUs for 200k iterations with learning rate 3e-5 and ̵1=1e-2 and ̵2=1e-3.
Software Dependencies No The paper mentions using SD-V1.5 and SDXL as base models but does not specify any ancillary software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes The model is trained on 6 A100 GPUs for 200k iterations with learning rate 3e-5 and ̵1=1e-2 and ̵2=1e-3. The layer numbers of Textual and Visual Qformer are set to 4.