Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering
Authors: Youngsun Lim, Hojun Choi, Hyunjung Shim
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation protocols measure image hallucination by testing if images from existing TTI models can correctly respond to these questions. We evaluate five TTI models using I-Hall A and reveal that these state-of-the-art models often fail to accurately convey factual information. Moreover, we validate the reliability of our metric by demonstrating a strong Spearman correlation (ρ=0.95) with human judgments. ... Experimental results show a strong correlation between our metric and human evaluation, with Spearman s ρ=0.95, indicating close alignment in assessing hallucination. |
| Researcher Affiliation | Academia | Kim Jaechul Graduate School of AI, KAIST EMAIL |
| Pseudocode | No | The paper describes a three-stage pipeline for constructing the benchmark and evaluation metric, but it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any links to a code repository. |
| Open Datasets | No | The paper introduces a new benchmark dataset called I-Hall A v1.0. While it is described as a 'curated benchmark dataset for this purpose' and intended to 'serve as a foundation,' it does not provide a specific URL, DOI, repository name, or any other concrete access information for the dataset. |
| Dataset Splits | No | The paper describes the composition of its benchmark dataset, I-Hall A v1.0, including categories like science and history, but it does not provide explicit training, validation, or test dataset splits for this benchmark or for any other dataset used in a way that implies reproducibility of model training. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running experiments, such as specific GPU or CPU models, or details about the computing environment. |
| Software Dependencies | No | The paper mentions several models and systems like GPT-4o, DALL-E 3, and various Stable Diffusion versions, along with other evaluation metrics and VLMs. However, it does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed to replicate the experimental environment. |
| Experiment Setup | No | The paper details the methodology for creating the I-Hall A benchmark and its evaluation metric, including how questions and answers are generated and how scores are calculated. However, it does not provide specific experimental setup details such as hyperparameter values, training configurations, or optimizer settings for any models (either the TTI models being evaluated or the VQA model used in I-Hall A). |