reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reproducibility Study of "ITI-GEN: Inclusive Text-to-Image Generation"

Authors: Daniel Gallo Fernández, Răzvan-Andrei Matișan, Alejandro Monroy Muñoz, Janusz Partyka

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This study aims to reproduce the results presented in ITI-Gen: Inclusive Text-to-Image Generation by Zhang et al. (2023a), which introduces a model to improve inclusiveness in these kinds of models. We show that most of the claims made by the authors about ITI-Gen hold: it improves the diversity and quality of generated images, it is scalable to different domains, it has plug-and-play capabilities, and it is efficient from a computational point of view. However, ITI-Gen sometimes uses undesired attributes as proxy features and it is unable to disentangle some pairs of (correlated) attributes such as gender and baldness. In addition, when the number of considered attributes increases, the training time grows exponentially and ITI-Gen struggles to generate inclusive images for all elements in the joint distribution. To solve these issues, we propose using Hard Prompt Search with negative prompting, a method that does not require training and that handles negation better than vanilla Hard Prompt Search.
Researcher Affiliation	Academia	Daniel Gallo Fernández University of Amsterdam Răzvan-Andrei Matis,an University of Amsterdam Alejandro Monroy Muñoz University of Amsterdam Janusz Partyka University of Amsterdam
Pseudocode	No	The paper describes the ITI-Gen model's architecture and training procedure, including loss functions, but it does not present this information in structured pseudocode or an algorithm block format.
Open Source Code	Yes	The code used to replicate and extend the experiments can be found in our Git Hub repository3. We made two minor changes to the authors implementation: adding a seed in the training loop of ITI-Gen to make it reproducible and fix a bug in the script for image generation to handle batch sizes larger than 1. We also provide bash scripts to replicate all our experiments easily. 3https://github.com/amonroym99/iti-gen-reproducibility
Open Datasets	Yes	The authors provide four datasets2 of reference images to train the model: Celeb A (Liu et al., 2015), a manually-labeled face dataset with 40 binary attributes. For each attribute, there are 200 positive and negative samples (400 in total). FAIR (Feng et al., 2022), a synthetic face dataset classified into six skin tone levels. There are 703 almost equally distributed images among the six categories. Fair Face (Karkkainen & Joo, 2021), a face dataset that contains annotations for age (9 intervals), gender (male or female) and race (7 categories). For every attribute, there are 200 images per category. Landscapes HQ (LHQ) (Skorokhodov et al., 2021), a dataset of natural scene images annotated using the tool provided in Wang et al. (2023). There are 11 different attributes, each of them divided into five ordinal categories with 400 samples per category.
Dataset Splits	No	The paper describes the number of reference images used per category for training (e.g., "200 positive and negative samples" or "200 images per category") and the number of images generated for evaluation (e.g., "generate 104 images (13 batches of size 8) per category" and "5,040 images" for FID score). However, it does not specify explicit train/test/validation splits of the overall datasets used for model training and evaluation.
Hardware Specification	Yes	We perform all experiments on an NVIDIA A100 GPU. Training a single attribute for 30 epochs takes around a minute for Celeb A, 3 minutes for LHQ, 4 minutes for FAIR and less than 5 minutes for Fair Face. For the four datasets, we use 200 images per category (or all of them if there are less than 200). Generating a batch of 8 images takes around 21 seconds (less than 3 seconds per image). It is also possible to run inference on an Apple M2 chip, although it takes more than 30 seconds per image.
Software Dependencies	No	Following the original paper, we use Stable Diffusion v1.4 (Rombach et al., 2022) for most of the experiments. We also show compatibility with models using additional conditions like Control Net (Zhang et al., 2023b) in a plug-and-play manner. We use the default training hyperparameters from the code provided. In a similar way, for image generation, we use the default hyperparameters from the Control Net (Zhang et al., 2023b) and Stable Diffusion (Rombach et al., 2022) repositories (for HPS and HPSn).
Experiment Setup	Yes	In order to reproduce the results of the original paper as closely as possible, we use the default training hyperparameters from the code provided. In a similar way, for image generation, we use the default hyperparameters from the Control Net (Zhang et al., 2023b) and Stable Diffusion (Rombach et al., 2022) repositories (for HPS and HPSn). Moreover, we generate images with a batch size of 8, which is the largest power of two that can fit in an NVIDIA A100 with 40 GB of VRAM. We set different seeds to generate images, which might explain why we get a better score (the authors do not report their generation method). Training a single attribute for 30 epochs takes around a minute for Celeb A, 3 minutes for LHQ, 4 minutes for FAIR and less than 5 minutes for Fair Face.