reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Progressive Compositionality in Text-to-Image Generative Models

Authors: Xu Han, Linghao Jin, Xiaofeng Liu, Paul Pu Liang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments across a wide range of compositional scenarios, we showcase the effectiveness of our proposed framework on compositional T2I benchmarks. The project page with data, code, and demos can be found at https://evansh666.github.io/EvoGen_Page/.
Researcher Affiliation	Academia	1Yale University, 2University of Southern California, 3Massachusetts Institute of Technology EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes methods and equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The project page with data, code, and demos can be found at https://evansh666.github.io/EvoGen_Page/.
Open Datasets	Yes	We introduce CONPAIR, a meticulously crafted compositional dataset consisting of high-quality contrastive images with minimal visual representation differences, covering a wide range of attribute categories... The project page with data, code, and demos can be found at https://evansh666.github.io/EvoGen_Page/.
Dataset Splits	Yes	We divide the dataset into three stages and introduce a simple but effective multi-stage fine-tuning paradigm... The dataset is organized into three stages, each progressively increasing in complexity. In Stage-I, the dataset includes simpler tasks such as Shape (500 samples), Color (800), Counting (800), Texture (800), Nonspatial relationships (800), and Scene (800), totaling 4,500 samples. Stage-II introduces more complex compositions, with each category including Shape, Color, Counting, Texture, Spatial relationships, Non-spatial relationships, and Scene containing 1,000 samples, for a total of 7,500 samples. Stage-III represents the most complex scenarios, with fewer but more intricate samples. We also include some simple cases like Stage-I and II, each contain 200 samples, while the Complex category includes 2,000 samples, totaling 3,400 samples. Across all stages, the dataset contains 15,400 samples, providing a wide range of compositional tasks for model training and evaluation.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions implementing their approach on "Stable Diffuion v2.1 and Stable Diffusion v3-medium" and employing "the pre-trained text encoder from the CLIP Vi T-L/14 model." However, it does not provide specific version numbers for general software dependencies like programming languages (e.g., Python), frameworks (e.g., PyTorch), or other libraries/solvers.
Experiment Setup	Yes	In an attempt to evaluate the faithfulness of generated images, we use GPT-4 to decompose a text prompt into a pair of questions and answers, which serve as the input of our VQA model, LLa VA v1.5 (Liu et al., 2024a). Following previous work (Huang et al., 2023; Feng et al., 2023a), we evaluate EVOGEN on Stable Diffusion v2 (Rombach et al., 2022)... The resolution is 768, the batch size is 16, and the learning rate is 3e-5 with linear decay.