Detecting Systematic Weaknesses in Vision Models along Predefined Human-Understandable Dimensions

Authors: Sujan Sai Gannamaneni, Rohil Prakash Rao, Michael Mock, Maram Akila, Stefan Wrobel

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our algorithm on both synthetic and real-world datasets, demonstrating its ability to recover human-understandable systematic weaknesses. Furthermore, using our approach, we identify systematic weaknesses of multiple pre-trained and publicly available state-of-the-art computer vision DNNs.
Researcher Affiliation Academia Sujan Sai Gannamaneni EMAIL Fraunhofer IAIS, Lamarr Institute Rohil Prakash Rao EMAIL Fraunhofer IAIS Michael Mock EMAIL Fraunhofer IAIS Maram Akila EMAIL Fraunhofer IAIS, Lamarr Institute Stefan Wrobel EMAIL Fraunhofer IAIS, University of Bonn
Pseudocode Yes Algorithm 1: Systematic Weakness Detector (SWD)
Open Source Code Yes Our implementation is available at https://github.com/sujan-sai-g/Systematic-Weakness-Detection.
Open Datasets Yes Five pre-trained models, Vi T-B-16 (Dosovitskiy et al., 2021), Faster R-CNN (Ren et al., 2015), SETR PUP (Zheng et al., 2021), Panoptic FCN (Li et al., 2021), and YOLOv11m (Jocher & Qiu, 2024) are evaluated using five public datasets (Celeb A (Liu et al., 2015), BDD100k (Yu et al., 2020), Cityscapes (Cordts et al., 2016), Rail Sem19 (Zendel et al., 2019)), and Euro City Persons dataset (Braun et al., 2019), respectively.
Dataset Splits Yes We obtain an accuracy of 94.48% on the 202 599 images in the Celeb A dataset. The models are evaluated on their respective datasets, i.e., BDD100k, Cityscapes, and Rail Sem19.
Hardware Specification No No specific hardware details (like GPU models, CPU types, or cloud configurations) used for running experiments are provided in the paper.
Software Dependencies No The paper refers to various models and methods like CLIP, Slice Line, Faster R-CNN, and YOLOv11m, but does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch versions, specific library versions) used for their implementation.
Experiment Setup Yes We restrict the number of combinations (level) to 2 in this work. We used the cutoff for the slice error as 1.5 e|D for all experiments except the Panoptic FCN model evaluation. In the Panoptic FCN evaluation, we utilize the cutoff point for the slice error as 1.0 e|D as the global average error is already quite high.