Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective

Authors: Xiangru Zhu, Penglei Sun, Yaoxian Song, Yanghua Xiao, Zhixu Li, Chengyu Wang, Jun Huang, Bei Yang, Xiaoxiao Xu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments reveal that the Cog View-3-Plus and Ideogram 2 performed the best, achieving a score of 0.2/1. Semantic variations in object relations are less understood than attributes, scoring 0.07/1 compared to 0.17-0.19/1. We found that cross-modal alignment in UNet or Transformers plays a crucial role in handling semantic variations, a factor previously overlooked by a focus on textual encoders. Our work establishes an effective evaluation framework that advances the T2I synthesis community s exploration of human instruction understanding.
Researcher Affiliation Collaboration 1Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University 2Hong Kong University of Science and Technology (Guangzhou) 3Zhejiang University 4School of Information, Renmin University of China; School of Smart Governance, Renmin University of China 5Alibaba Group
Pseudocode No No pseudocode or algorithm blocks are explicitly labeled in the paper. The methodology is described through mathematical definitions and textual explanations.
Open Source Code Yes 1Our benchmark and code are available at https://github.com/zhuxiangru/Sem Var Bench.
Open Datasets Yes Our benchmark and code are available at https://github.com/zhuxiangru/Sem Var Bench. We create a semantic variation dataset for T2I synthesis through two types of linguistic permutations. We choose all 171 sentence pairs suitable for T2I synthesis from Winoground (Thrush et al., 2022; Diwan et al., 2022) as seed pairs.
Dataset Splits Yes Sem Var Bench comprises 11,454 samples of (Ta, Tpv, Tpi), divided into a training set and a test set. The training set contains 10,806 samples, while the test set consists of 648 samples.
Hardware Specification Yes Our computational resources included an NVIDIA Ge Force RTX 4090 with 25.2 GB of VRAM and a 16-core AMD EPYC 9354 processor, with 60.1 GB of system memory available. We also used a Tesla V100-SXM2 with 32 GB of VRAM and an 11-core Intel(R) Xeon(R) Platinum 8163 processor, with 88.0 GB of system memory available.
Software Dependencies No The paper mentions using the "diffusers library" for fine-tuning but does not provide specific version numbers for this or any other key software components.
Experiment Setup Yes In our experiments, we utilized Stable Diffusion XL v1.0 to generate an image for each text prompt within the training set. To select the training data, we designated C2 = 0.8 and C3 = 0.1. Ultimately, we selected 327 samples, resulting in 981 sentences. We only fine-tuned the Lo RA model either on the UNet or the text encoder for 5,000 steps, with a training batch size of 1. We trained the Lo RA model with a rank of 4 on UNet or text encoders, and the training process took approximately 0.5 hours.