reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Vision and Language Concepts for Controllable Image Generation

Authors: Shaoan Xie, Lingjing Kong, Yujia Zheng, Zeyu Tang, Eric Xing, Guangyi Chen, Kun Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations show that our model outperforms existing methods in T2I generation tasks, offering superior controllability and interpretability. In this section, we first describe the experimental setup, including implementation details, datasets, baselines, and evaluation metrics. We then present the results, covering comparisons with baselines, visualizations of learned concepts, disentanglement analysis, and ablation studies.
Researcher Affiliation	Academia	1Carnegie Mellon University 2Mohamed bin Zayed University of Artificial Intelligence.
Pseudocode	No	The paper describes the model design and loss functions in text and with diagrams, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper states: "We implement our method based on SANA (Xie et al., 2024)." This indicates usage of another open-source project but does not explicitly state that the authors' own code for the methodology described in this paper is released or publicly available.
Open Datasets	Yes	We use FLUX-S (Labs, 2024) to generate 2 million images using prompts sourced from the LAION dataset. Subsequently, we employ QWEN2-VL (Wang et al., 2024b) to produce accurate textual descriptions. To evaluate the controllability of our generative model, we need to generate pairs of images that reflect specific target changes. For this purpose, we utilize the EMU-Edit dataset (Sheynin et al., 2024).
Dataset Splits	No	The paper mentions using the LAION dataset to generate images for training and the EMU-Edit dataset for evaluation, including 3,589 paired prompts. However, it does not specify explicit training/validation/test splits, percentages, or sample counts for these datasets.
Hardware Specification	Yes	Sampling efficiency comparison on a H100 GPU.
Software Dependencies	No	The paper states: "We implement our method based on SANA (Xie et al., 2024)" and mentions using "Siglip (Zhai et al., 2023) image embedding" and "Lo RA". However, it does not provide specific version numbers for these software components or any programming languages/libraries.
Experiment Setup	Yes	We define the number of textual tokens as 64. Then we use a transformer block to transform Siglip (Zhai et al., 2023) image embedding into the mean and variance of the latent ϵ. Then we feed the re-parametrized latent into a 6-block perceiver resampler with masking m z T. Finally, we obtain the image representation z I and replace the original text embedding with this representation. We use Lo RA on the diffusion transformer with rank 256. All the parameters are trained with batch size 768 and learning rate 5 10 5.