Revisiting text-to-image evaluation with Gecko: on metrics, prompts, and human rating
Authors: Olivia Wiles, Chuhan Zhang, Isabela Albuquerque, Ivana Kajić, Su Wang, Emanuele Bugliarello, Yasumasa Onoe, Pinelopi Papalampidi, Ira Ktena, Christopher Knutsen, Cyrus Rashtchian, Anant Nawalgaria, Jordi Pont-Tuset, Aida Nematzadeh
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We address this by introducing an evaluation suite of >100K annotations across four human annotation templates that comprehensively evaluates models capabilities across a a range of methods for gathering human annotations and comparing models. In particular, we propose (1) a carefully curated set of prompts Gecko2K; (2) a statistically grounded method of comparing T2I models; and (3) a framework to systematically evaluate metrics under three evaluation tasks model ordering, pair-wise instance scoring, point-wise instance scoring. |
| Researcher Affiliation | Industry | Correspondence to: EMAIL; EMAIL. Google Deep Mind, Google Research, Google Cloud. |
| Pseudocode | No | The paper describes methods verbally and refers to 'sample template' or 'few shot prompts' in appendices and listings (e.g., 'Listing 1', 'Listing 3', 'Listing 4') which are typically example inputs or snippets, not structured pseudocode or algorithm blocks. No explicit section or figure for 'Pseudocode' or 'Algorithm' is present in the main text. |
| Open Source Code | Yes | Github link: https://github.com/google-deepmind/gecko_benchmark_t2i |
| Open Datasets | Yes | Gecko: An evaluation suite for T2I alignment which includes a comprehensive set of 2K prompts, 4 human templates to evaluate 4 T2I models to give 100K human annotations (Table 1). We also include the number of skills and sub-skills in each dataset. Again, Gecko includes the most number of sub-skills, allowing for a fine-grained evaluation of metrics and models. The Gecko2K benchmark is similar in spirit to TIFA and DSG1K in that it evaluates a set of skills. However, in addition to drawing from previous datasets which may be biased or poorly representative of the challenges of a particular skill we collate prompts across sub-skills for each skill to obtain a discriminative prompt set. Moreover, we gather human annotations across multiple templates and many prompts (see Table 1). Github link: https://github.com/google-deepmind/gecko_benchmark_t2i |
| Dataset Splits | Yes | We additionally remove instances where all Likert ratings are Unsure to get 531 and 725 reliable prompts for Gecko(R) and Gecko(S), respectively. We first validate that using reliable prompts increases IAA on the Sx S template (which was not used in the selection process) and find that it increases the average Kω from 0.45 to 0.47 on Gecko(R), and 0.49 to 0.54 on Gecko(S) (see App. D.2 for details). In the next sections, we demonstrate how this subset of prompts increases agreement among templates, but at the expense of removing some potentially meaningful prompts. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It mentions models like 'Gemini Flash', 'Pa LM-2', and 'Pa LI' but not the underlying hardware. |
| Software Dependencies | No | The paper mentions various models like 'CLIP', 'Pyramid CLIP', 'Gemini Flash', 'Pa LM-2', 'Pa LI', and 'T5-11B' that were used, but it does not specify any software libraries or frameworks with their corresponding version numbers (e.g., Python 3.8, PyTorch 1.9, TensorFlow 2.x). |
| Experiment Setup | Yes | When evaluating the Gecko metric, apart from using the LLM and VQA models above, we utilise a T5-11B model from Honovich et al. (2022) for NLI filtering and set the threshold r at 0.005. This threshold was determined by examining QA pairs with NLI probability scores below 0.05. We observed that QAs with scores below 0.005 are typically hallucinations. |