A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

Authors: Zicheng Zhang, Haoning Wu, Chunyi Li, Yingjie Zhou, Wei Sun, Xiongkuo Min, Zijian Chen, Xiaohong Liu, Weisi Lin, Guangtao Zhai

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Therefore, we introduce A-Bench in this paper, a benchmark designed to diagnose whether LMMs are masters at evaluating AIGIs. Specifically, A-Bench is organized under two key principles: ... Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts. We then test 23 prominent LMMs, including both open-source and closed-source models, on the A-Bench.
Researcher Affiliation Academia Zicheng Zhang1 , Haoning Wu2 , Chunyi Li1, Yingjie Zhou1, Wei Sun1, Xiongkuo Min1, Zijian Chen1, Xiaohong Liu1, Weisi Lin2, Guangtao Zhai1 , 1Shanghai Jiaotong University, 2Nanyang Technological University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks describing the methodology. It includes a 'GPT Evaluation Prompt Template' in the appendix, which is a text prompt template, not an algorithm.
Open Source Code Yes Project Page: https://github.com/Q-Future/A-Bench.
Open Datasets Yes Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts. ... Specifically, a comprehensive dataset of 2,864 AIGIs sourced from various T2I models is compiled... AIGIQA-20K dataset (Li et al., 2024)... q-align (Wu et al., 2023c)... The A-Bench dataset is released under the CC BY 4.0 license.
Dataset Splits No The paper describes the internal structure of the A-Bench dataset, specifying 1,408 AIGIs for A-Bench P1 and 1,456 for A-Bench P2, and further sub-categories within these. It also mentions sampling strategies for these parts. However, it does not provide traditional train/validation/test splits as the models being evaluated are pre-trained and tested in a zero-shot setting.
Hardware Specification Yes Proprietary LMMs are evaluated via official APIs, whereas the open-source LMMs (with the exceptions of LLa VA-Ne XT Qwen-72B and LLa VA-Ne XT Qwen-110B) run on an NVIDIA RTX 6000 Ada with 48 GB of memory. The LLa VA-Ne XT Qwen-72B and LLa VA-Ne XT Qwen-110B are operated on 4 NVIDIA H100 with 320 GB of memory.
Software Dependencies No The paper does not list specific version numbers for ancillary software components such as programming languages or libraries (e.g., Python, PyTorch, TensorFlow). It mentions using various LMMs (e.g., GPT-4o, Gemini 1.5 Pro) and a 'GPT-assisted choice evaluation technique', but without specifying the versions of the underlying software used for development or evaluation by the authors themselves.
Experiment Setup Yes All LMMs are tested with zero-shot setting. ... All LMMs operate with default parameters... we set the model s temperature parameter to 0, meaning the LMM s output will no longer be affected by randomness.