reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Evaluating Neuron Explanations: A Unified Framework with Sanity Checks

Authors: Tuomas Oikarinen, Ge Yan, Tsui-Wei Weng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform two versions of these tests, Experimental on real neurons across diverse settings, and Theoretical on ideal neurons described below. ... we perform an additional comparison between evaluation metrics by empirically comparing how well they perform on neurons where we know their ground truth function
Researcher Affiliation	Academia	1CSE, UC San Diego, CA, USA 2HDSI, UC San Diego, CA, USA. Correspondence to: Tuomas Oikarinen <EMAIL>, Tsui-Wei Weng <EMAIL>.
Pseudocode	No	The paper describes various mathematical formulations and evaluation methods but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and results are publicly available at https://github.com/Trustworthy-ML-Lab/NeuronEval.
Open Datasets	Yes	We evaluated vision models across 3 datasets: Imagenet, Places365 and CUB200, while language models were evaluated on a subset of Open Web Text(Gokaslan et al., 2019). ... The Image Net (Deng et al., 2009), Places (Zhou et al., 2017) and GPT-2 (Radford et al., 2019) models were pretrained. ... For CLIP, we used the pretrained model from (Radford et al., 2021), and then learned a linear probe on top of frozen image embeddings to minimize binary cross-entropy loss on the training split of CUB200(Wah et al., 2011)
Dataset Splits	Yes	For all experiments we split a random 5% of the neurons into validation set. For metrics that require hyperparameters such as α, we use the hyperparameters that performed the best in terms of Meta-AUPRC on the validation split for each setting. We then report performance on the remaining 95% of neurons. ... For CLIP, we used the pretrained model from (Radford et al., 2021), and then learned a linear probe on top of frozen image embeddings to minimize binary cross-entropy loss on the training split of CUB200(Wah et al., 2011), with early stopping using validation data.
Hardware Specification	No	The paper discusses various models (Vi T-B-16, ResNet-50, ResNet-18, GPT-2-small, GPT-2-XL) and datasets used in experiments but does not provide specific hardware details such as GPU or CPU models.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies used in the experiments.
Experiment Setup	Yes	For all experiments we split a random 5% of the neurons into validation set. For metrics that require hyperparameters such as α, we use the hyperparameters that performed the best in terms of Meta-AUPRC on the validation split for each setting. We then report performance on the remaining 95% of neurons. For all evaluations we used neuron activations after the activation function (i.e. softmax/sigmoid). ... For layer4(after avg pool) neurons we defined the correct concept tk as the concept that maximizes Io U with α = 0.005 similar to (Bau et al., 2017), using the class(and superclass) labels of the dataset as ct. For these layers we fixed α = 0.005 for all metrics as that was used to determine the ground truth.