Enhancing Diversity in Text-to-Image Generation without Compromising Fidelity
Authors: Jiazhi Li, Mi Zhou, Mahyar Khayatkhoei, Jingyu Shi, Xiang Gao, Jiageng Zhu, Hanchen Xie, Xiyun Song, Zongfang Lin, Heather Yu, Jieyu Zhao
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments to demonstrate the effectiveness of our method in enhancing demographic diversity (Intersectional Diversity (Shrestha et al., 2024)) by 2.47 and sample diversity (Recall (Kynkäänniemi et al., 2019)) by 1.45 while preserving sample fidelity (Precision (Kynkäänniemi et al., 2019)) compared to the baseline diffusion model. We quantitatively demonstrate that SOTA models and existing diversity-focused methods struggle to capture real-world sample variability in Tab. 1 and Fig. 4, and exhibit limited prompt reusability in Fig. 6. |
| Researcher Affiliation | Collaboration | Jiazhi Li EMAIL Ming Hsieh Department of Electrical and Computer Engineering University of Southern California Mi Zhou EMAIL School of Electrical and Computer Engineering Georgia Institute of Technology Mahyar Khayatkhoei EMAIL Information Sciences Institute University of Southern California Jingyu Shi EMAIL Elmore Family School of Electrical and Computer Engineering Purdue University Xiang Gao EMAIL Department of Computer Science Stony Brook University Jiageng Zhu EMAIL Ming Hsieh Department of Electrical and Computer Engineering University of Southern California Hanchen Xie EMAIL Thomas Lord Department of Computer Science University of Southern California Xiyun Song EMAIL Futurewei Technologies Inc. Zongfang Lin EMAIL Futurewei Technologies Inc. Heather Yu EMAIL Futurewei Technologies Inc. Jieyu Zhao EMAIL Thomas Lord Department of Computer Science University of Southern California |
| Pseudocode | Yes | Algorithm 1 Training a Diffusion Model with Bimodal Classifier-Free Guidance... Algorithm 2 Inference with Bimodal Classifier-Free Guidance |
| Open Source Code | No | The paper does not contain an explicit statement such as "We release our code" nor does it provide a direct link to a code repository for the methodology described in this paper. |
| Open Datasets | Yes | Following (Shrestha et al., 2024), we construct non-overlapping training and reference image datasets from human images in MSCOCO (Lin et al., 2014) and Open Imagesv6 (Krasin et al., 2017), and store the pre-computed CLIP image embeddings (Radford et al., 2021) to speed up the inference process by bypassing the usage of the image encoder during inference... Additionally, using lower-quality retrieval databases (e.g., Celeb A (Liu et al., 2018)) slightly lowers fidelity and quality but maintains competitive diversity |
| Dataset Splits | No | The paper mentions "non-overlapping training and reference image datasets" for MSCOCO and Open Imagesv6, but does not provide specific percentages, sample counts, or explicit details of how these datasets were split into training, validation, or testing sets for their experiments. |
| Hardware Specification | Yes | For inference time, the baseline (SDv2.1) and our method take 2.77 seconds and 3.86 seconds, respectively, to generate a single image with 20 denoising steps on a single NVIDIA H100 GPU. |
| Software Dependencies | No | The paper mentions using components like CLIP image encoder, CLIP text encoder, VAE, and U-Net, and refers to Chat GPT-o1, but it does not specify version numbers for any of the software libraries or dependencies used in the implementation of their method. |
| Experiment Setup | Yes | We set ωp as 7.5 following (Ho and Salimans, 2022) and choose ωi < ωp to prioritize text modality in image generation... we randomly discard conditioning during training. Specifically, we replace the training prompt with an empty sequence with probability πp = 0.1, following (Saharia et al., 2022), and replace the image embedding with the all-zero embeddings of the same size with probability πi = 0.1... For evaluation, we use prompts of 80 occupations... and generate 10,000 images (i.e., 125 per prompt)... our method take 2.77 seconds and 3.86 seconds, respectively, to generate a single image with 20 denoising steps on a single NVIDIA H100 GPU. |