reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Explaining Explainability: Recommendations for Effective Use of Concept Activation Vectors

Authors: Angus Nicolson, Lisa Schut, Alison Noble, Yarin Gal

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments are performed on natural images (Image Net), skin lesions (ISIC 2019), and a new synthetic dataset, Elements. Elements is designed to capture a known ground truth relationship between concepts and classes. We release this dataset to facilitate further research in understanding and evaluating interpretability methods.
Researcher Affiliation	Academia	Angus Nicolson EMAIL Institute of Biomedical Engineering University of Oxford; Lisa Schut EMAIL OATML, Department of Computer Science University of Oxford; Alison J. Noble EMAIL Institute of Biomedical Engineering University of Oxford; Yarin Gal EMAIL OATML, Department of Computer Science University of Oxford
Pseudocode	No	The paper describes methods and concepts through mathematical equations and definitions but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The text states: 'We release this dataset to facilitate further research in understanding and evaluating interpretability methods.' This refers to the dataset 'Elements' but does not explicitly state that the source code for the methodology described in the paper is released or provide a link to a code repository.
Open Datasets	Yes	Our experiments are performed on natural images (Image Net), skin lesions (ISIC 2019), and a new synthetic dataset, Elements. Elements is designed to capture a known ground truth relationship between concepts and classes. We release this dataset to facilitate further research in understanding and evaluating interpretability methods. Image Net (Deng et al., 2009) ISIC 2019 dataset (Tschandl et al., 2018; Codella et al., 2017; Combalia et al., 2019)
Dataset Splits	No	For the ISIC 2019 dataset, the paper mentions 'training until convergence of validation loss to achieve an area under the receiver operating characteristic curve (AUC) of 0.91 on the validation split.' For the Elements dataset, it mentions 'giving a validation accuracy of 99.98% for the standard dataset.' While these statements imply the use of validation splits, specific percentages or sample counts for training, validation, and test sets are not explicitly provided.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions 'Torch Vision package in Py Torch' and 'Adam optimiser (Kingma & Ba, 2015)' but does not specify version numbers for PyTorch or any other software components, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	We train the model using Adam (Kingma & Ba, 2015) with a learning rate of 1e-3 until the training accuracy is greater than 99.99%, giving a validation accuracy of 99.98% for the standard dataset. We finetuned a Vi T-B16 model pretrained on Image Net for 50 epochs on the spatially dependent version of Elements (i.e. there are some classes which depend on the location of the objects as well as which concepts are present). We used an exponentially decaying learning rate with initial learning rate of 0.0001 and a γ of 0.95.