Learning Vision and Language Concepts for Controllable Image Generation

Authors: Shaoan Xie, Lingjing Kong, Yujia Zheng, Zeyu Tang, Eric Xing, Guangyi Chen, Kun Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations show that our model outperforms existing methods in T2I generation tasks, offering superior controllability and interpretability. In this section, we first describe the experimental setup, including implementation details, datasets, baselines, and evaluation metrics. We then present the results, covering comparisons with baselines, visualizations of learned concepts, disentanglement analysis, and ablation studies.
Researcher Affiliation Academia 1Carnegie Mellon University 2Mohamed bin Zayed University of Artificial Intelligence.
Pseudocode No The paper describes the model design and loss functions in text and with diagrams, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper states: "We implement our method based on SANA (Xie et al., 2024)." This indicates usage of another open-source project but does not explicitly state that the authors' own code for the methodology described in this paper is released or publicly available.
Open Datasets Yes We use FLUX-S (Labs, 2024) to generate 2 million images using prompts sourced from the LAION dataset. Subsequently, we employ QWEN2-VL (Wang et al., 2024b) to produce accurate textual descriptions. To evaluate the controllability of our generative model, we need to generate pairs of images that reflect specific target changes. For this purpose, we utilize the EMU-Edit dataset (Sheynin et al., 2024).
Dataset Splits No The paper mentions using the LAION dataset to generate images for training and the EMU-Edit dataset for evaluation, including 3,589 paired prompts. However, it does not specify explicit training/validation/test splits, percentages, or sample counts for these datasets.
Hardware Specification Yes Sampling efficiency comparison on a H100 GPU.
Software Dependencies No The paper states: "We implement our method based on SANA (Xie et al., 2024)" and mentions using "Siglip (Zhai et al., 2023) image embedding" and "Lo RA". However, it does not provide specific version numbers for these software components or any programming languages/libraries.
Experiment Setup Yes We define the number of textual tokens as 64. Then we use a transformer block to transform Siglip (Zhai et al., 2023) image embedding into the mean and variance of the latent ϵ. Then we feed the re-parametrized latent into a 6-block perceiver resampler with masking m z T. Finally, we obtain the image representation z I and replace the original text embedding with this representation. We use Lo RA on the diffusion transformer with rank 256. All the parameters are trained with batch size 768 and learning rate 5 10 5.