Progressive Compositionality in Text-to-Image Generative Models
Authors: Xu Han, Linghao Jin, Xiaofeng Liu, Paul Pu Liang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments across a wide range of compositional scenarios, we showcase the effectiveness of our proposed framework on compositional T2I benchmarks. The project page with data, code, and demos can be found at https://evansh666.github.io/EvoGen_Page/. |
| Researcher Affiliation | Academia | 1Yale University, 2University of Southern California, 3Massachusetts Institute of Technology EMAIL EMAIL EMAIL |
| Pseudocode | No | The paper describes methods and equations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The project page with data, code, and demos can be found at https://evansh666.github.io/EvoGen_Page/. |
| Open Datasets | Yes | We introduce CONPAIR, a meticulously crafted compositional dataset consisting of high-quality contrastive images with minimal visual representation differences, covering a wide range of attribute categories... The project page with data, code, and demos can be found at https://evansh666.github.io/EvoGen_Page/. |
| Dataset Splits | Yes | We divide the dataset into three stages and introduce a simple but effective multi-stage fine-tuning paradigm... The dataset is organized into three stages, each progressively increasing in complexity. In Stage-I, the dataset includes simpler tasks such as Shape (500 samples), Color (800), Counting (800), Texture (800), Nonspatial relationships (800), and Scene (800), totaling 4,500 samples. Stage-II introduces more complex compositions, with each category including Shape, Color, Counting, Texture, Spatial relationships, Non-spatial relationships, and Scene containing 1,000 samples, for a total of 7,500 samples. Stage-III represents the most complex scenarios, with fewer but more intricate samples. We also include some simple cases like Stage-I and II, each contain 200 samples, while the Complex category includes 2,000 samples, totaling 3,400 samples. Across all stages, the dataset contains 15,400 samples, providing a wide range of compositional tasks for model training and evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions implementing their approach on "Stable Diffuion v2.1 and Stable Diffusion v3-medium" and employing "the pre-trained text encoder from the CLIP Vi T-L/14 model." However, it does not provide specific version numbers for general software dependencies like programming languages (e.g., Python), frameworks (e.g., PyTorch), or other libraries/solvers. |
| Experiment Setup | Yes | In an attempt to evaluate the faithfulness of generated images, we use GPT-4 to decompose a text prompt into a pair of questions and answers, which serve as the input of our VQA model, LLa VA v1.5 (Liu et al., 2024a). Following previous work (Huang et al., 2023; Feng et al., 2023a), we evaluate EVOGEN on Stable Diffusion v2 (Rombach et al., 2022)... The resolution is 768, the batch size is 16, and the learning rate is 3e-5 with linear decay. |