reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Revisiting text-to-image evaluation with Gecko: on metrics, prompts, and human rating

Authors: Olivia Wiles, Chuhan Zhang, Isabela Albuquerque, Ivana Kajić, Su Wang, Emanuele Bugliarello, Yasumasa Onoe, Pinelopi Papalampidi, Ira Ktena, Christopher Knutsen, Cyrus Rashtchian, Anant Nawalgaria, Jordi Pont-Tuset, Aida Nematzadeh

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We address this by introducing an evaluation suite of >100K annotations across four human annotation templates that comprehensively evaluates models capabilities across a a range of methods for gathering human annotations and comparing models. In particular, we propose (1) a carefully curated set of prompts Gecko2K; (2) a statistically grounded method of comparing T2I models; and (3) a framework to systematically evaluate metrics under three evaluation tasks model ordering, pair-wise instance scoring, point-wise instance scoring.
Researcher Affiliation	Industry	Correspondence to: EMAIL; EMAIL. Google Deep Mind, Google Research, Google Cloud.
Pseudocode	No	The paper describes methods verbally and refers to 'sample template' or 'few shot prompts' in appendices and listings (e.g., 'Listing 1', 'Listing 3', 'Listing 4') which are typically example inputs or snippets, not structured pseudocode or algorithm blocks. No explicit section or figure for 'Pseudocode' or 'Algorithm' is present in the main text.
Open Source Code	Yes	Github link: https://github.com/google-deepmind/gecko_benchmark_t2i
Open Datasets	Yes	Gecko: An evaluation suite for T2I alignment which includes a comprehensive set of 2K prompts, 4 human templates to evaluate 4 T2I models to give 100K human annotations (Table 1). We also include the number of skills and sub-skills in each dataset. Again, Gecko includes the most number of sub-skills, allowing for a ﬁne-grained evaluation of metrics and models. The Gecko2K benchmark is similar in spirit to TIFA and DSG1K in that it evaluates a set of skills. However, in addition to drawing from previous datasets which may be biased or poorly representative of the challenges of a particular skill we collate prompts across sub-skills for each skill to obtain a discriminative prompt set. Moreover, we gather human annotations across multiple templates and many prompts (see Table 1). Github link: https://github.com/google-deepmind/gecko_benchmark_t2i
Dataset Splits	Yes	We additionally remove instances where all Likert ratings are Unsure to get 531 and 725 reliable prompts for Gecko(R) and Gecko(S), respectively. We ﬁrst validate that using reliable prompts increases IAA on the Sx S template (which was not used in the selection process) and ﬁnd that it increases the average Kω from 0.45 to 0.47 on Gecko(R), and 0.49 to 0.54 on Gecko(S) (see App. D.2 for details). In the next sections, we demonstrate how this subset of prompts increases agreement among templates, but at the expense of removing some potentially meaningful prompts.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It mentions models like 'Gemini Flash', 'Pa LM-2', and 'Pa LI' but not the underlying hardware.
Software Dependencies	No	The paper mentions various models like 'CLIP', 'Pyramid CLIP', 'Gemini Flash', 'Pa LM-2', 'Pa LI', and 'T5-11B' that were used, but it does not specify any software libraries or frameworks with their corresponding version numbers (e.g., Python 3.8, PyTorch 1.9, TensorFlow 2.x).
Experiment Setup	Yes	When evaluating the Gecko metric, apart from using the LLM and VQA models above, we utilise a T5-11B model from Honovich et al. (2022) for NLI ﬁltering and set the threshold r at 0.005. This threshold was determined by examining QA pairs with NLI probability scores below 0.05. We observed that QAs with scores below 0.005 are typically hallucinations.