reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

REPEAT: Improving Uncertainty Estimation in Representation Learning Explainability

Authors: Kristoffer K. Wickstrøm, Thea Brüsch, Michael C. Kampffmeyer, Robert Jenssen

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive evaluation shows that REPEAT gives certainty estimates that are more intuitive, better at detecting out-of-distribution data, and more concise. Our contributions are: 2. Extensive evaluation across numerous feature extractors and datasets and comparison with state-of-the-art baselines. Results show that REPEAT produces more intuitive uncertainty estimates that are better at detecting out-of-distribution data and has lower complexity, compared to other state-of-the-art methods. 3. Evaluation on a downstream task where uncertainty is used to detect poisoned data in the unsupervised representation learning setting (He, Zha, and Katabi 2023).
Researcher Affiliation	Academia	1Department of Physics and Technology, UiT The Arctic University of Norway 2Department of Applied Mathematics and Computer Science, Technical University of Denmark 3Norwegian Computing Center, Oslo, Norway 4Pioneer Centre for AI, University of Copenhagen, Denmark *Corresponding author: EMAIL
Pseudocode	No	The paper describes the methodology using prose, equations (Eq. 1-6), and an overview figure (Fig. 2), but it does not include a clearly labeled pseudocode block or algorithm.
Open Source Code	Yes	Code https://github.com/Wickstrom/REPEAT/
Open Datasets	Yes	We use four widely used computer vision datasets; MS-COCO (Lin et al. 2014), Pascal-VOC (Everingham et al. 2009), Euro SAT (Helber et al. 2018), and Fashion MNIST (Xiao, Rasul, and Vollgraf 2017).
Dataset Splits	No	In all experiments, we randomly sample 1000 images from the dataset used for evaluation. We found that this was enough samples to provide reliable estimates of performance while still being computationally tractable.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	For simplicity and reproducibility, we use the pretrained weights from Pytorch (Paszke et al. 2019) for supervised classification of Image Net (Deng et al. 2009).
Experiment Setup	Yes	REPEAT design choices: In all presented results, we generate K=10 realizations of the Bernoulli RVs and use the mean to perform the thresholding. Both of these choices are determined by quantitative evaluation that is reported in App. B. As the base stochastic R-XAI method we use RELAX (Wickstrøm et al. 2023), due to its high performance in recent works. ... Specifically, we follow Wang et al. (Wang et al. 2019), where Dropout is applied to the input (Dropout probability of 0.5). Here, we create 10 Dropout-versions of each image and calculate importance using the baseline methods. Uncertainty is computed by taking the standard deviation across all 10 importance maps. ... In all experiments, we randomly sample 1000 images from the dataset used for evaluation. ... RELAX and REPEAT experiments were repeated 3 times.