Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective
Authors: Xiangru Zhu, Penglei Sun, Yaoxian Song, Yanghua Xiao, Zhixu Li, Chengyu Wang, Jun Huang, Bei Yang, Xiaoxiao Xu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments reveal that the Cog View-3-Plus and Ideogram 2 performed the best, achieving a score of 0.2/1. Semantic variations in object relations are less understood than attributes, scoring 0.07/1 compared to 0.17-0.19/1. We found that cross-modal alignment in UNet or Transformers plays a crucial role in handling semantic variations, a factor previously overlooked by a focus on textual encoders. Our work establishes an effective evaluation framework that advances the T2I synthesis community s exploration of human instruction understanding. |
| Researcher Affiliation | Collaboration | 1Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University 2Hong Kong University of Science and Technology (Guangzhou) 3Zhejiang University 4School of Information, Renmin University of China; School of Smart Governance, Renmin University of China 5Alibaba Group |
| Pseudocode | No | No pseudocode or algorithm blocks are explicitly labeled in the paper. The methodology is described through mathematical definitions and textual explanations. |
| Open Source Code | Yes | 1Our benchmark and code are available at https://github.com/zhuxiangru/Sem Var Bench. |
| Open Datasets | Yes | Our benchmark and code are available at https://github.com/zhuxiangru/Sem Var Bench. We create a semantic variation dataset for T2I synthesis through two types of linguistic permutations. We choose all 171 sentence pairs suitable for T2I synthesis from Winoground (Thrush et al., 2022; Diwan et al., 2022) as seed pairs. |
| Dataset Splits | Yes | Sem Var Bench comprises 11,454 samples of (Ta, Tpv, Tpi), divided into a training set and a test set. The training set contains 10,806 samples, while the test set consists of 648 samples. |
| Hardware Specification | Yes | Our computational resources included an NVIDIA Ge Force RTX 4090 with 25.2 GB of VRAM and a 16-core AMD EPYC 9354 processor, with 60.1 GB of system memory available. We also used a Tesla V100-SXM2 with 32 GB of VRAM and an 11-core Intel(R) Xeon(R) Platinum 8163 processor, with 88.0 GB of system memory available. |
| Software Dependencies | No | The paper mentions using the "diffusers library" for fine-tuning but does not provide specific version numbers for this or any other key software components. |
| Experiment Setup | Yes | In our experiments, we utilized Stable Diffusion XL v1.0 to generate an image for each text prompt within the training set. To select the training data, we designated C2 = 0.8 and C3 = 0.1. Ultimately, we selected 327 samples, resulting in 981 sentences. We only fine-tuned the Lo RA model either on the UNet or the text encoder for 5,000 steps, with a training batch size of 1. We trained the Lo RA model with a rank of 4 on UNet or text encoders, and the training process took approximately 0.5 hours. |