reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SelfEval: Leveraging discriminative nature of generative models for evaluation

Authors: Sai Saketh Rambhatla, Ishan Misra

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate generative models on standard datasets created for multimodal text-image discriminative learning and assess fine-grained aspects of their performance: attribute binding, color recognition, counting, shape recognition, spatial understanding. ... We now use Self Eval to evaluate text-to-image diffusion models. In 4.1, we introduce our benchmark datasets and models, and present the Self Eval results in Section 4.2.
Researcher Affiliation	Industry	Sai Saketh Rambhatla EMAIL Gen AI, Meta Ishan Misra EMAIL Gen AI, Meta
Pseudocode	No	The paper describes the method using mathematical equations and text, and illustrates it conceptually in Figure 1, but does not include a distinct pseudocode or algorithm block.
Open Source Code	No	The paper mentions that one of the models they used is "accessed via an API containing open-sourced model weights" but there is no explicit statement or link indicating that the authors have open-sourced the code for the Self Eval methodology itself.
Open Datasets	Yes	Self Eval repurposes standard multimodal image-text datasets such as Visual Genome, COCO and CLEVR to measure the model s text understanding capabilities. ... The six tasks are constructed using data from TIFA Hu et al. (2023), CLEVR Johnson et al. (2016) and ARO Yuksekgonul et al. (2023). ... We now use the challenging Winoground Thrush et al. (2022) benchmark to evaluate the vision-language reasoning abilities of diffusion models.
Dataset Splits	Yes	We adopt the splits proposed by Lewis et al. (2022) for our case. ... For each benchmark task, we randomly sample 1000 examples and evaluate the classification performance on them. We repeat this three times and the report the mean accuracy. ... We randomly pick 250 text prompts from each benchmark task as conditioning for human evaluation and the images are generated using DDIM Song et al. (2021) sampling, with 100 denoising steps. ... The data, for human evaluation, is constructed by randomly picking 500 examples from the all the tasks (100 examples from each task except text corruption).
Hardware Specification	No	The paper mentions characteristics of the models such as "outputs images of 512 × 512 resolution" or "has a total of 4.2B parameters", and describes training steps, but does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments or training the models.
Software Dependencies	No	The paper discusses various models, text encoders (CLIP, T5), and sampling methods (DDIM Song et al. (2021)), but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We use 10 trials (i.e. N = 10) and perform diffusion for 100 steps (i.e. T = 100) for all the models. ... We randomly pick 250 text prompts from each benchmark task as conditioning for human evaluation and the images are generated using DDIM Song et al. (2021) sampling, with 100 denoising steps. ... We randomly sample 1000 examples and evaluate the classification performance on them. We repeat this three times and the report the mean accuracy.