reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

Authors: Zicheng Zhang, Haoning Wu, Chunyi Li, Yingjie Zhou, Wei Sun, Xiongkuo Min, Zijian Chen, Xiaohong Liu, Weisi Lin, Guangtao Zhai

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Therefore, we introduce A-Bench in this paper, a benchmark designed to diagnose whether LMMs are masters at evaluating AIGIs. Specifically, A-Bench is organized under two key principles: ... Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts. We then test 23 prominent LMMs, including both open-source and closed-source models, on the A-Bench.
Researcher Affiliation	Academia	Zicheng Zhang1 , Haoning Wu2 , Chunyi Li1, Yingjie Zhou1, Wei Sun1, Xiongkuo Min1, Zijian Chen1, Xiaohong Liu1, Weisi Lin2, Guangtao Zhai1 , 1Shanghai Jiaotong University, 2Nanyang Technological University
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks describing the methodology. It includes a 'GPT Evaluation Prompt Template' in the appendix, which is a text prompt template, not an algorithm.
Open Source Code	Yes	Project Page: https://github.com/Q-Future/A-Bench.
Open Datasets	Yes	Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts. ... Specifically, a comprehensive dataset of 2,864 AIGIs sourced from various T2I models is compiled... AIGIQA-20K dataset (Li et al., 2024)... q-align (Wu et al., 2023c)... The A-Bench dataset is released under the CC BY 4.0 license.
Dataset Splits	No	The paper describes the internal structure of the A-Bench dataset, specifying 1,408 AIGIs for A-Bench P1 and 1,456 for A-Bench P2, and further sub-categories within these. It also mentions sampling strategies for these parts. However, it does not provide traditional train/validation/test splits as the models being evaluated are pre-trained and tested in a zero-shot setting.
Hardware Specification	Yes	Proprietary LMMs are evaluated via official APIs, whereas the open-source LMMs (with the exceptions of LLa VA-Ne XT Qwen-72B and LLa VA-Ne XT Qwen-110B) run on an NVIDIA RTX 6000 Ada with 48 GB of memory. The LLa VA-Ne XT Qwen-72B and LLa VA-Ne XT Qwen-110B are operated on 4 NVIDIA H100 with 320 GB of memory.
Software Dependencies	No	The paper does not list specific version numbers for ancillary software components such as programming languages or libraries (e.g., Python, PyTorch, TensorFlow). It mentions using various LMMs (e.g., GPT-4o, Gemini 1.5 Pro) and a 'GPT-assisted choice evaluation technique', but without specifying the versions of the underlying software used for development or evaluation by the authors themselves.
Experiment Setup	Yes	All LMMs are tested with zero-shot setting. ... All LMMs operate with default parameters... we set the model s temperature parameter to 0, meaning the LMM s output will no longer be affected by randomness.