SelfEval: Leveraging discriminative nature of generative models for evaluation
Authors: Sai Saketh Rambhatla, Ishan Misra
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate generative models on standard datasets created for multimodal text-image discriminative learning and assess fine-grained aspects of their performance: attribute binding, color recognition, counting, shape recognition, spatial understanding. ... We now use Self Eval to evaluate text-to-image diffusion models. In 4.1, we introduce our benchmark datasets and models, and present the Self Eval results in Section 4.2. |
| Researcher Affiliation | Industry | Sai Saketh Rambhatla EMAIL Gen AI, Meta Ishan Misra EMAIL Gen AI, Meta |
| Pseudocode | No | The paper describes the method using mathematical equations and text, and illustrates it conceptually in Figure 1, but does not include a distinct pseudocode or algorithm block. |
| Open Source Code | No | The paper mentions that one of the models they used is "accessed via an API containing open-sourced model weights" but there is no explicit statement or link indicating that the authors have open-sourced the code for the Self Eval methodology itself. |
| Open Datasets | Yes | Self Eval repurposes standard multimodal image-text datasets such as Visual Genome, COCO and CLEVR to measure the model s text understanding capabilities. ... The six tasks are constructed using data from TIFA Hu et al. (2023), CLEVR Johnson et al. (2016) and ARO Yuksekgonul et al. (2023). ... We now use the challenging Winoground Thrush et al. (2022) benchmark to evaluate the vision-language reasoning abilities of diffusion models. |
| Dataset Splits | Yes | We adopt the splits proposed by Lewis et al. (2022) for our case. ... For each benchmark task, we randomly sample 1000 examples and evaluate the classification performance on them. We repeat this three times and the report the mean accuracy. ... We randomly pick 250 text prompts from each benchmark task as conditioning for human evaluation and the images are generated using DDIM Song et al. (2021) sampling, with 100 denoising steps. ... The data, for human evaluation, is constructed by randomly picking 500 examples from the all the tasks (100 examples from each task except text corruption). |
| Hardware Specification | No | The paper mentions characteristics of the models such as "outputs images of 512 × 512 resolution" or "has a total of 4.2B parameters", and describes training steps, but does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments or training the models. |
| Software Dependencies | No | The paper discusses various models, text encoders (CLIP, T5), and sampling methods (DDIM Song et al. (2021)), but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We use 10 trials (i.e. N = 10) and perform diffusion for 100 steps (i.e. T = 100) for all the models. ... We randomly pick 250 text prompts from each benchmark task as conditioning for human evaluation and the images are generated using DDIM Song et al. (2021) sampling, with 100 denoising steps. ... We randomly sample 1000 examples and evaluate the classification performance on them. We repeat this three times and the report the mean accuracy. |