reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Diversity in Text-to-Image Generation without Compromising Fidelity

Authors: Jiazhi Li, Mi Zhou, Mahyar Khayatkhoei, Jingyu Shi, Xiang Gao, Jiageng Zhu, Hanchen Xie, Xiyun Song, Zongfang Lin, Heather Yu, Jieyu Zhao

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments to demonstrate the effectiveness of our method in enhancing demographic diversity (Intersectional Diversity (Shrestha et al., 2024)) by 2.47 and sample diversity (Recall (Kynkäänniemi et al., 2019)) by 1.45 while preserving sample fidelity (Precision (Kynkäänniemi et al., 2019)) compared to the baseline diffusion model. We quantitatively demonstrate that SOTA models and existing diversity-focused methods struggle to capture real-world sample variability in Tab. 1 and Fig. 4, and exhibit limited prompt reusability in Fig. 6.
Researcher Affiliation	Collaboration	Jiazhi Li EMAIL Ming Hsieh Department of Electrical and Computer Engineering University of Southern California Mi Zhou EMAIL School of Electrical and Computer Engineering Georgia Institute of Technology Mahyar Khayatkhoei EMAIL Information Sciences Institute University of Southern California Jingyu Shi EMAIL Elmore Family School of Electrical and Computer Engineering Purdue University Xiang Gao EMAIL Department of Computer Science Stony Brook University Jiageng Zhu EMAIL Ming Hsieh Department of Electrical and Computer Engineering University of Southern California Hanchen Xie EMAIL Thomas Lord Department of Computer Science University of Southern California Xiyun Song EMAIL Futurewei Technologies Inc. Zongfang Lin EMAIL Futurewei Technologies Inc. Heather Yu EMAIL Futurewei Technologies Inc. Jieyu Zhao EMAIL Thomas Lord Department of Computer Science University of Southern California
Pseudocode	Yes	Algorithm 1 Training a Diffusion Model with Bimodal Classifier-Free Guidance... Algorithm 2 Inference with Bimodal Classifier-Free Guidance
Open Source Code	No	The paper does not contain an explicit statement such as "We release our code" nor does it provide a direct link to a code repository for the methodology described in this paper.
Open Datasets	Yes	Following (Shrestha et al., 2024), we construct non-overlapping training and reference image datasets from human images in MSCOCO (Lin et al., 2014) and Open Imagesv6 (Krasin et al., 2017), and store the pre-computed CLIP image embeddings (Radford et al., 2021) to speed up the inference process by bypassing the usage of the image encoder during inference... Additionally, using lower-quality retrieval databases (e.g., Celeb A (Liu et al., 2018)) slightly lowers fidelity and quality but maintains competitive diversity
Dataset Splits	No	The paper mentions "non-overlapping training and reference image datasets" for MSCOCO and Open Imagesv6, but does not provide specific percentages, sample counts, or explicit details of how these datasets were split into training, validation, or testing sets for their experiments.
Hardware Specification	Yes	For inference time, the baseline (SDv2.1) and our method take 2.77 seconds and 3.86 seconds, respectively, to generate a single image with 20 denoising steps on a single NVIDIA H100 GPU.
Software Dependencies	No	The paper mentions using components like CLIP image encoder, CLIP text encoder, VAE, and U-Net, and refers to Chat GPT-o1, but it does not specify version numbers for any of the software libraries or dependencies used in the implementation of their method.
Experiment Setup	Yes	We set ωp as 7.5 following (Ho and Salimans, 2022) and choose ωi < ωp to prioritize text modality in image generation... we randomly discard conditioning during training. Specifically, we replace the training prompt with an empty sequence with probability πp = 0.1, following (Saharia et al., 2022), and replace the image embedding with the all-zero embeddings of the same size with probability πi = 0.1... For evaluation, we use prompts of 80 occupations... and generate 10,000 images (i.e., 125 per prompt)... our method take 2.77 seconds and 3.86 seconds, respectively, to generate a single image with 20 denoising steps on a single NVIDIA H100 GPU.