CoCoIns: Consistent Subject Generation via Contrastive Instantiated Concepts

Authors: Lee Hsin-Ying, Kelvin C.K. Chan, Ming-Hsuan Yang

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on human faces with a single subject show that Co Co Ins performs comparably to existing methods while maintaining greater flexibility. We also demonstrate the potential for extending Co Co Ins to multiple subjects and other object categories. The source code and model weights are available at https://contrastive-concept-instantiation.github.io. ... We conduct experiments on human images and perform systematic evaluations, including generating portrait photographs and free-form images. We achieve favorable subject consistency and prompt fidelity compared to batch-generation approaches. We also demonstrate early success in extending our approach to multi-subject and general concepts. ... Table 1 shows the quantitative performance on Portraits, and Figure 3 displays two subjects, a man and a woman, each with four images, generated by all approaches.
Researcher Affiliation Collaboration Lee Hsin-Ying EMAIL University of California, Merced Kelvin C.K. Chan EMAIL Google Deep Mind Ming-Hsuan Yang EMAIL University of California, Merced
Pseudocode No The paper describes the methodology in narrative text and mathematical equations, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The source code and model weights are available at https://contrastive-concept-instantiation.github.io.
Open Datasets Yes We train the mapping network using the Celeb A dataset (Liu et al., 2015), which comprises 20K images and 10K identities. ... The main experiments are conducted on Celeb A (Liu et al., 2015). It is made available for noncommercial research purposes and requires users to comply with the terms outlined in the official usage agreement. ... We train the model with animal (Choi et al., 2020) and car (Yu et al., 2015) images.
Dataset Splits No The paper mentions using the Celeb A dataset for training and defines specific evaluation prompt sets (Portraits, Scenes), but it does not provide explicit training/validation/test splits for the Celeb A dataset itself. It states, "We train the mapping network using the Celeb A dataset (Liu et al., 2015), which comprises 20K images and 10K identities." and describes how the evaluation sets (Portraits and Scenes) are constructed for testing, but not how the training data from Celeb A was partitioned for its own training process.
Hardware Specification Yes The experiments are conducted on an AMD EPYC 9354 CPU and four NVIDIA A6000 GPUs.
Software Dependencies No The paper mentions using "Stable Diffusion XL (Podell et al., 2023)", "LLaVA-Next (Liu et al., 2023a)", and "Grounded SAM 2 (Kirillov et al., 2023; Liu et al., 2023b; Ren et al., 2024)" as models/tools. However, it does not specify version numbers for these or other underlying software libraries (e.g., Python, PyTorch, TensorFlow versions) that would be needed to reproduce the experiments.
Experiment Setup Yes Table 7. Hyperparameters. Hyperparameters Value c 256 λcon 1 λback 30 γ 0.00001 β 2 n 8 K 5000 Batch Size 128 Learning Rate 0.0001 Learning Rate Decay Cosine Learning Rate Warmup 500 Optimizer Adam Weight Decay 0.2