Detecting Systematic Weaknesses in Vision Models along Predefined Human-Understandable Dimensions
Authors: Sujan Sai Gannamaneni, Rohil Prakash Rao, Michael Mock, Maram Akila, Stefan Wrobel
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our algorithm on both synthetic and real-world datasets, demonstrating its ability to recover human-understandable systematic weaknesses. Furthermore, using our approach, we identify systematic weaknesses of multiple pre-trained and publicly available state-of-the-art computer vision DNNs. |
| Researcher Affiliation | Academia | Sujan Sai Gannamaneni EMAIL Fraunhofer IAIS, Lamarr Institute Rohil Prakash Rao EMAIL Fraunhofer IAIS Michael Mock EMAIL Fraunhofer IAIS Maram Akila EMAIL Fraunhofer IAIS, Lamarr Institute Stefan Wrobel EMAIL Fraunhofer IAIS, University of Bonn |
| Pseudocode | Yes | Algorithm 1: Systematic Weakness Detector (SWD) |
| Open Source Code | Yes | Our implementation is available at https://github.com/sujan-sai-g/Systematic-Weakness-Detection. |
| Open Datasets | Yes | Five pre-trained models, Vi T-B-16 (Dosovitskiy et al., 2021), Faster R-CNN (Ren et al., 2015), SETR PUP (Zheng et al., 2021), Panoptic FCN (Li et al., 2021), and YOLOv11m (Jocher & Qiu, 2024) are evaluated using five public datasets (Celeb A (Liu et al., 2015), BDD100k (Yu et al., 2020), Cityscapes (Cordts et al., 2016), Rail Sem19 (Zendel et al., 2019)), and Euro City Persons dataset (Braun et al., 2019), respectively. |
| Dataset Splits | Yes | We obtain an accuracy of 94.48% on the 202 599 images in the Celeb A dataset. The models are evaluated on their respective datasets, i.e., BDD100k, Cityscapes, and Rail Sem19. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or cloud configurations) used for running experiments are provided in the paper. |
| Software Dependencies | No | The paper refers to various models and methods like CLIP, Slice Line, Faster R-CNN, and YOLOv11m, but does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch versions, specific library versions) used for their implementation. |
| Experiment Setup | Yes | We restrict the number of combinations (level) to 2 in this work. We used the cutoff for the slice error as 1.5 e|D for all experiments except the Panoptic FCN model evaluation. In the Panoptic FCN evaluation, we utilize the cutoff point for the slice error as 1.0 e|D as the global average error is already quite high. |