reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Human and AI Perceptual Differences in Image Classification Errors

Authors: Minghao Liu, Jiaheng Wei, Yang Liu, James Davis

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This study first analyzes the statistical distributions of mistakes from the two sources and then explores how task difficulty level affects these distributions. We find that even when AI learns an excellent model from the training data, one that outperforms humans in overall accuracy, these AI models have significant and consistent differences from human perception. We demonstrate the importance of studying these differences with a simple human-AI teaming algorithm that outperforms humans alone, AI alone, or AI-AI teaming.
Researcher Affiliation	Academia	Minghao Liu1, Jiaheng Wei2, Yang Liu1, James Davis1 1 Department of Computer Science and Engineering, University of California, Santa Cruz 2 Data Science and Analytics Thrust, Hong Kong University of Science and Technology (Guangzhou) EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods and algorithms but does not present any structured pseudocode or algorithm blocks. The mathematical formulations are presented in standard prose or equations.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code, nor does it provide a link to a code repository or supplementary materials for code.
Open Datasets	Yes	The top-ranked machine vision models can achieve extremely high accuracy on CIFAR-10 (Krizhevsky, Hinton et al. 2009) image classification by training on clean labels. ... In this paper, we adopt CIFAR-N (Wei et al. 2022c), a label-noise benchmark that provides three noisy human annotations for each image of the CIFAR-10 training dataset.
Dataset Splits	Yes	We split the training set into a 40K training subset and a 10K test subset. We explore human perceptual differences using these noisy human annotations.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running the experiments. It only discusses software, datasets, and general experimental results without specifying the underlying hardware.
Software Dependencies	No	The paper does not provide specific version numbers for any software libraries or frameworks used in the experiments. It mentions neural networks and machine learning classifiers in general but no concrete software dependencies with versions.
Experiment Setup	No	The paper discusses various machine learning models and their accuracy but does not provide specific hyperparameters like learning rates, batch sizes, number of epochs, or optimizer settings. It describes the general methodology and evaluations but lacks the detailed configuration necessary for reproduction.