reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering

Authors: Youngsun Lim, Hojun Choi, Hyunjung Shim

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation protocols measure image hallucination by testing if images from existing TTI models can correctly respond to these questions. We evaluate five TTI models using I-Hall A and reveal that these state-of-the-art models often fail to accurately convey factual information. Moreover, we validate the reliability of our metric by demonstrating a strong Spearman correlation (ρ=0.95) with human judgments. ... Experimental results show a strong correlation between our metric and human evaluation, with Spearman s ρ=0.95, indicating close alignment in assessing hallucination.
Researcher Affiliation	Academia	Kim Jaechul Graduate School of AI, KAIST EMAIL
Pseudocode	No	The paper describes a three-stage pipeline for constructing the benchmark and evaluation metric, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any links to a code repository.
Open Datasets	No	The paper introduces a new benchmark dataset called I-Hall A v1.0. While it is described as a 'curated benchmark dataset for this purpose' and intended to 'serve as a foundation,' it does not provide a specific URL, DOI, repository name, or any other concrete access information for the dataset.
Dataset Splits	No	The paper describes the composition of its benchmark dataset, I-Hall A v1.0, including categories like science and history, but it does not provide explicit training, validation, or test dataset splits for this benchmark or for any other dataset used in a way that implies reproducibility of model training.
Hardware Specification	No	The paper does not specify any particular hardware used for running experiments, such as specific GPU or CPU models, or details about the computing environment.
Software Dependencies	No	The paper mentions several models and systems like GPT-4o, DALL-E 3, and various Stable Diffusion versions, along with other evaluation metrics and VLMs. However, it does not list specific software dependencies with their version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed to replicate the experimental environment.
Experiment Setup	No	The paper details the methodology for creating the I-Hall A benchmark and its evaluation metric, including how questions and answers are generated and how scores are calculated. However, it does not provide specific experimental setup details such as hyperparameter values, training configurations, or optimizer settings for any models (either the TTI models being evaluated or the VQA model used in I-Hall A).