Interpretable Failure Detection with Human-Level Concepts

Authors: Kien X. Nguyen, Tang Li, Xi Peng

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We rigorously validate our method s efficacy in detecting incorrect samples across both natural and remote sensing image benchmarks... We evaluate ORCA on a wide variety of datasets... We report the performance of all methods on the three evaluation metrics on the natural image benchmarks... Ablation Studies
Researcher Affiliation Academia Department of Computer and Information Sciences University of Delaware Newark, DE, USA EMAIL
Pseudocode No The paper describes the methods using mathematical equations and textual descriptions, but there is no explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Code https://github.com/Nyquixt/ORCA
Open Datasets Yes Datasets. We evaluate ORCA on a wide variety of datasets: 1. Natural Image Benchmark (1) CIFAR-10/100 (Krizhevsky 2009)... (2) Image Net1K (Deng et al. 2009)... 2. Satellite Image Benchmark (3) Euro SAT (Helber et al. 2017)... (4) RESISC45 (Cheng, Han, and Lu 2017)
Dataset Splits Yes Image Net1K (Deng et al. 2009) a well-known benchmark in computer vision, containing 1000 fine-grained categories, with 1,281,167 training and 50,000 validation samples.
Hardware Specification No The paper mentions using "CLIP s Res Net-101 and Vi T-B/32 backbones" but does not specify the hardware (e.g., GPU/CPU models, memory) on which these models were run for experiments.
Software Dependencies No The paper mentions using "CLIP (Radford et al. 2021)" and "GPT-3.5 (Brown et al. 2020; Peng et al. 2023)" but does not specify version numbers for these or any other software libraries or frameworks used in the implementation.
Experiment Setup Yes We use the default temperature T = 1000 and do not use perturbation for fair comparison... We study the effect of the number of concepts on the performance on AUROC and FPR@95TPR of Desc CLIP + MSP, ODIN, DOCTOR and ORCA-R... For dataset with few categories... we use different prompts to retrieve diverse collections of concepts from the large language model GPT-3.5... and manually select the top 10 visual concepts... An example of our prompt is as follows... For datasets with a larger number of categories... we then select the top concepts that yield the highest average similarity score with the images within each category to form A.