reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DCBM: Data-Efficient Visual Concept Bottleneck Models

Authors: Katharina Prasse, Patrick Knab, Sascha Marton, Christian Bartelt, Margret Keuper

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We extensively evaluate the DCBM framework across concept proposal generation methods and datasets. To ensure its usefulness for real-life applications, we assess the transferability of DCBMs to out-of-domain (OOD) settings and verify the localization of the important concepts. We benchmark DCBM on standard CBM datasets and assess the generalization capabilities of its visual concepts on the out-of-distribution dataset Image Net-R. Moreover, we evaluate it on two uncommon datasets for XAI, Mi T-States and Climate TV, and the less common awa2 and Celeb A datasets.
Researcher Affiliation	Academia	1Data and Web Science Group, University of Mannheim, Mannheim, Germany 2Clausthal University of Technology, Clausthal, Germany 3Max-Planck-Institute for Informatics, Saarland Informatics Campus. Correspondence to: Katharina Prasse <EMAIL>, Patrick Knab <EMAIL>.
Pseudocode	Yes	We describe the DCBM framework in the main paper. For better understanding, we provide the framework with pseudocode in Algorithm 1 and introduce notation to this end. Algorithm 1 DCBM Framework
Open Source Code	Yes	The code is available at: https://github.com/Kath Pra/ DCBM.
Open Datasets	Yes	We evaluate DCBMs on the five commonly used datasets in the CBM community. For general image classification, we employ CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), and Image Net (Deng et al., 2009), as they offer a wide range of classes. For domain-specific tasks, we use CUB (Wah et al., 2011) and Places365 (Zhou et al., 2017), which provide targeted, domain-specific categories. Additionally, we ablate DCBMs on Image Nette (Howard, 2019a) and the fine-grained dog classification dataset Image Woof (Howard, 2019b). Moreover, we evaluate on awa2 (Xian et al., 2018), Celeb A (Liu et al., 2015), and a subset of Image Net (i.e. first 200 classes) to compare against other CBMs and to exemplify its applicability to diverse datasets. We further evaluate it on two novel datasets for the XAI community, the social-media dataset Climate TV (Prasse et al., 2023) and Mi T-States (Isola et al., 2015), as inspired by (Yun et al., 2023). We also evaluate on Image Net-R (Hendrycks et al., 2021) to assess CBM performance under complete domain shifts. See Appendix B for a detailed dataset overview.
Dataset Splits	Yes	For the datasets which do not have a test split, we use the validation split for testing and create a new split, i.e., , 10% of train set, for validation. Consequently, the train split comprises only 90% of the original train images (see Table 8). Table 8. Ablation datasets. Overview of the datasets used for ablation (Image Woof, Image Nette, and CUB-200-2011). Dataset Classes Images %(train / val / test) Image Nette 10 13,000 (70 / 30 / 0) Image Woof 10 12,000 (70 / 30 / 0) CUB-200-2011 200 11,788 (50 / 50 / 0).
Hardware Specification	Yes	Concept clustering and CBM training takes five minutes for small datasets and up to two hours for large datasets on a single RTX A6000 (more in Appendix E.1).
Software Dependencies	No	Our implementation is in Python and Pytorch, and our CBM implementation is based on (Rao et al., 2024; Yuksekgonul et al., 2023). For Grad CAM (Selvaraju et al., 2019) calculations, we use the implementation by (Zakka, 2021), which we adjust to Vi T s and concept instead of text matching based on Gildenblat et al. s implementation (Gildenblat & contributors, 2021).
Experiment Setup	Yes	Hyperparameters. We ablate all hyperparameters on a held-out validation set. In case this does not exist, we construct it to comprise 10% randomly-selected, classbalanced training samples. Using this setting, we ablate on the CUB, Image Nette and Image Woof datasets to find suitable hyperparameters. The search space comprises a learning rate lr = {1e 4, 1e 3, 1e 2}, sparsityparameter λ = {1e 4, 1e 3, 1e 2} number of clusters k = {128, 256, 512, 1024, 2048, 4096}, and concept proposal models. For all datasets and concept proposal generation methods, lr = 1e 4 and sparsity parameter λ = 1e 4 were found to be optimal. The number of clusters k depends on the size of the concept proposal set S, but we observed a tendency regarding greater values of k, and stick to k = 2048 if not mentioned otherwise. We train each DCBM model for 200 epochs with a batch size of 512 (Places365 & Image Net) or 32 (all remaining).