Progressive Compositionality in Text-to-Image Generative Models

Authors: Xu Han, Linghao Jin, Xiaofeng Liu, Paul Pu Liang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments across a wide range of compositional scenarios, we showcase the effectiveness of our proposed framework on compositional T2I benchmarks. The project page with data, code, and demos can be found at https://evansh666.github.io/EvoGen_Page/.
Researcher Affiliation Academia 1Yale University, 2University of Southern California, 3Massachusetts Institute of Technology EMAIL EMAIL EMAIL
Pseudocode No The paper describes methods and equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The project page with data, code, and demos can be found at https://evansh666.github.io/EvoGen_Page/.
Open Datasets Yes We introduce CONPAIR, a meticulously crafted compositional dataset consisting of high-quality contrastive images with minimal visual representation differences, covering a wide range of attribute categories... The project page with data, code, and demos can be found at https://evansh666.github.io/EvoGen_Page/.
Dataset Splits Yes We divide the dataset into three stages and introduce a simple but effective multi-stage fine-tuning paradigm... The dataset is organized into three stages, each progressively increasing in complexity. In Stage-I, the dataset includes simpler tasks such as Shape (500 samples), Color (800), Counting (800), Texture (800), Nonspatial relationships (800), and Scene (800), totaling 4,500 samples. Stage-II introduces more complex compositions, with each category including Shape, Color, Counting, Texture, Spatial relationships, Non-spatial relationships, and Scene containing 1,000 samples, for a total of 7,500 samples. Stage-III represents the most complex scenarios, with fewer but more intricate samples. We also include some simple cases like Stage-I and II, each contain 200 samples, while the Complex category includes 2,000 samples, totaling 3,400 samples. Across all stages, the dataset contains 15,400 samples, providing a wide range of compositional tasks for model training and evaluation.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions implementing their approach on "Stable Diffuion v2.1 and Stable Diffusion v3-medium" and employing "the pre-trained text encoder from the CLIP Vi T-L/14 model." However, it does not provide specific version numbers for general software dependencies like programming languages (e.g., Python), frameworks (e.g., PyTorch), or other libraries/solvers.
Experiment Setup Yes In an attempt to evaluate the faithfulness of generated images, we use GPT-4 to decompose a text prompt into a pair of questions and answers, which serve as the input of our VQA model, LLa VA v1.5 (Liu et al., 2024a). Following previous work (Huang et al., 2023; Feng et al., 2023a), we evaluate EVOGEN on Stable Diffusion v2 (Rombach et al., 2022)... The resolution is 768, the batch size is 16, and the learning rate is 3e-5 with linear decay.