Interpretable Failure Detection with Human-Level Concepts
Authors: Kien X. Nguyen, Tang Li, Xi Peng
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We rigorously validate our method s efficacy in detecting incorrect samples across both natural and remote sensing image benchmarks... We evaluate ORCA on a wide variety of datasets... We report the performance of all methods on the three evaluation metrics on the natural image benchmarks... Ablation Studies |
| Researcher Affiliation | Academia | Department of Computer and Information Sciences University of Delaware Newark, DE, USA EMAIL |
| Pseudocode | No | The paper describes the methods using mathematical equations and textual descriptions, but there is no explicitly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Code https://github.com/Nyquixt/ORCA |
| Open Datasets | Yes | Datasets. We evaluate ORCA on a wide variety of datasets: 1. Natural Image Benchmark (1) CIFAR-10/100 (Krizhevsky 2009)... (2) Image Net1K (Deng et al. 2009)... 2. Satellite Image Benchmark (3) Euro SAT (Helber et al. 2017)... (4) RESISC45 (Cheng, Han, and Lu 2017) |
| Dataset Splits | Yes | Image Net1K (Deng et al. 2009) a well-known benchmark in computer vision, containing 1000 fine-grained categories, with 1,281,167 training and 50,000 validation samples. |
| Hardware Specification | No | The paper mentions using "CLIP s Res Net-101 and Vi T-B/32 backbones" but does not specify the hardware (e.g., GPU/CPU models, memory) on which these models were run for experiments. |
| Software Dependencies | No | The paper mentions using "CLIP (Radford et al. 2021)" and "GPT-3.5 (Brown et al. 2020; Peng et al. 2023)" but does not specify version numbers for these or any other software libraries or frameworks used in the implementation. |
| Experiment Setup | Yes | We use the default temperature T = 1000 and do not use perturbation for fair comparison... We study the effect of the number of concepts on the performance on AUROC and FPR@95TPR of Desc CLIP + MSP, ODIN, DOCTOR and ORCA-R... For dataset with few categories... we use different prompts to retrieve diverse collections of concepts from the large language model GPT-3.5... and manually select the top 10 visual concepts... An example of our prompt is as follows... For datasets with a larger number of categories... we then select the top concepts that yield the highest average similarity score with the images within each category to form A. |