reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Interpretable Failure Detection with Human-Level Concepts

Authors: Kien X. Nguyen, Tang Li, Xi Peng

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We rigorously validate our method s efficacy in detecting incorrect samples across both natural and remote sensing image benchmarks... We evaluate ORCA on a wide variety of datasets... We report the performance of all methods on the three evaluation metrics on the natural image benchmarks... Ablation Studies
Researcher Affiliation	Academia	Department of Computer and Information Sciences University of Delaware Newark, DE, USA EMAIL
Pseudocode	No	The paper describes the methods using mathematical equations and textual descriptions, but there is no explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	Code https://github.com/Nyquixt/ORCA
Open Datasets	Yes	Datasets. We evaluate ORCA on a wide variety of datasets: 1. Natural Image Benchmark (1) CIFAR-10/100 (Krizhevsky 2009)... (2) Image Net1K (Deng et al. 2009)... 2. Satellite Image Benchmark (3) Euro SAT (Helber et al. 2017)... (4) RESISC45 (Cheng, Han, and Lu 2017)
Dataset Splits	Yes	Image Net1K (Deng et al. 2009) a well-known benchmark in computer vision, containing 1000 fine-grained categories, with 1,281,167 training and 50,000 validation samples.
Hardware Specification	No	The paper mentions using "CLIP s Res Net-101 and Vi T-B/32 backbones" but does not specify the hardware (e.g., GPU/CPU models, memory) on which these models were run for experiments.
Software Dependencies	No	The paper mentions using "CLIP (Radford et al. 2021)" and "GPT-3.5 (Brown et al. 2020; Peng et al. 2023)" but does not specify version numbers for these or any other software libraries or frameworks used in the implementation.
Experiment Setup	Yes	We use the default temperature T = 1000 and do not use perturbation for fair comparison... We study the effect of the number of concepts on the performance on AUROC and FPR@95TPR of Desc CLIP + MSP, ODIN, DOCTOR and ORCA-R... For dataset with few categories... we use different prompts to retrieve diverse collections of concepts from the large language model GPT-3.5... and manually select the top 10 visual concepts... An example of our prompt is as follows... For datasets with a larger number of categories... we then select the top concepts that yield the highest average similarity score with the images within each category to form A.