reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Are High-Quality AI-Generated Images More Difficult for Models to Detect?

Authors: Yao Xiao, Binbin Yang, Weiyan Chen, Jiahao Chen, Zijie Cao, Ziyi Dong, Xiangyang Ji, Liang Lin, Wei Ke, Pengxu Wei

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	However, our systematic study on cutting-edge text-to-image generators reveals a counterintuitive finding: AIGIs with higher quality scores, as assessed by human preference models, tend to be more easily detected by existing models. To investigate this, we examine how the text prompts for generation and image characteristics influence both quality scores and detector accuracy. Furthermore, through clustering and regression analyses, we verify that image characteristics like saturation, contrast, and texture richness collectively impact both image quality and detector accuracy. Finally, we demonstrate that the performance of off-the-shelf detectors can be enhanced across diverse generators and datasets by selecting input patches based on the predicted scores of our regression models, thus substantiating the broader applicability of our findings.
Researcher Affiliation	Academia	1School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China 2School of Software Engineering, Xi an Jiaotong University, Xi an, China 3Department of Automation, Tsinghua University, Beijing, China 4Peng Cheng Laboratory, Shenzhen, China.
Pseudocode	No	The paper describes methodologies and analyses, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data are available at Git Hub.
Open Datasets	Yes	To this end, we construct a high-quality and diverse dataset by 1) collecting real images from four source datasets; 2) obtaining 4,000 captions spanning a wide range of complexity from these real images; and 3) generating fake images using these captions as prompts based on text-to-image generators, e.g., Stable Diffusion 2.1 (SD 2.1) (Rombach et al., 2022), Stable Diffusion XL 1.0 (SDXL 1.0) (Podell et al., 2024), Stable Diffusion 3 (SD 3) (Esser et al., 2024), and Pix Art-α (Chen et al., 2024c). ... we collect real images from four existing datasets: COCO (Lin et al., 2014), CC3M (Sharma et al., 2018), LAION-Aesthetic (Schuhmann et al., 2022), and SA-1B (Kirillov et al., 2023).
Dataset Splits	No	The paper describes dataset collection and evaluation of detectors on various generators and datasets. While it mentions training the SSP model on Gen Image, it does not provide specific training/validation/test splits for the authors' own collected dataset or for the regression models used in their analysis.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions various models and tools used (e.g., BLIP-2, DINOv2, K-Means algorithm, Canny edge detector), but it does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	No	The paper describes its evaluation setup and mentions using official pre-trained weights and configurations for existing AIGI detectors. It also discusses linear regression analyses. However, it does not provide specific hyperparameters (e.g., learning rate, batch size, epochs) or detailed training configurations for any new models or analyses presented in the main text (e.g., the regression models).