A-Bench: Are LMMs Masters at Evaluating AI-generated Images?
Authors: Zicheng Zhang, Haoning Wu, Chunyi Li, Yingjie Zhou, Wei Sun, Xiongkuo Min, Zijian Chen, Xiaohong Liu, Weisi Lin, Guangtao Zhai
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Therefore, we introduce A-Bench in this paper, a benchmark designed to diagnose whether LMMs are masters at evaluating AIGIs. Specifically, A-Bench is organized under two key principles: ... Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts. We then test 23 prominent LMMs, including both open-source and closed-source models, on the A-Bench. |
| Researcher Affiliation | Academia | Zicheng Zhang1 , Haoning Wu2 , Chunyi Li1, Yingjie Zhou1, Wei Sun1, Xiongkuo Min1, Zijian Chen1, Xiaohong Liu1, Weisi Lin2, Guangtao Zhai1 , 1Shanghai Jiaotong University, 2Nanyang Technological University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks describing the methodology. It includes a 'GPT Evaluation Prompt Template' in the appendix, which is a text prompt template, not an algorithm. |
| Open Source Code | Yes | Project Page: https://github.com/Q-Future/A-Bench. |
| Open Datasets | Yes | Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts. ... Specifically, a comprehensive dataset of 2,864 AIGIs sourced from various T2I models is compiled... AIGIQA-20K dataset (Li et al., 2024)... q-align (Wu et al., 2023c)... The A-Bench dataset is released under the CC BY 4.0 license. |
| Dataset Splits | No | The paper describes the internal structure of the A-Bench dataset, specifying 1,408 AIGIs for A-Bench P1 and 1,456 for A-Bench P2, and further sub-categories within these. It also mentions sampling strategies for these parts. However, it does not provide traditional train/validation/test splits as the models being evaluated are pre-trained and tested in a zero-shot setting. |
| Hardware Specification | Yes | Proprietary LMMs are evaluated via official APIs, whereas the open-source LMMs (with the exceptions of LLa VA-Ne XT Qwen-72B and LLa VA-Ne XT Qwen-110B) run on an NVIDIA RTX 6000 Ada with 48 GB of memory. The LLa VA-Ne XT Qwen-72B and LLa VA-Ne XT Qwen-110B are operated on 4 NVIDIA H100 with 320 GB of memory. |
| Software Dependencies | No | The paper does not list specific version numbers for ancillary software components such as programming languages or libraries (e.g., Python, PyTorch, TensorFlow). It mentions using various LMMs (e.g., GPT-4o, Gemini 1.5 Pro) and a 'GPT-assisted choice evaluation technique', but without specifying the versions of the underlying software used for development or evaluation by the authors themselves. |
| Experiment Setup | Yes | All LMMs are tested with zero-shot setting. ... All LMMs operate with default parameters... we set the model s temperature parameter to 0, meaning the LMM s output will no longer be affected by randomness. |