reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Model Guidance via Robust Feature Attribution

Authors: Mihnea Ghitu, Vihari Piratla, Matthew Robert Wicker

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across a comprehensive series of experiments, we show that our approach consistently reduces test-time misclassifications by 20% compared to state-of-the-art methods. We also extend prior experimental settings to include natural language processing tasks. Additionally, we conduct novel ablations that yield practical insights, including the relative importance of annotation quality over quantity.
Researcher Affiliation	Academia	Mihnea Ghitu EMAIL Imperial College London Vihari Piratla EMAIL University of Cambridge Matthew Wicker EMAIL Imperial College London
Pseudocode	No	The paper describes mathematical formulations and high-level steps for Rand-R4, Adv-R4, and Cert-R4, but it does not include a distinct block of structured pseudocode or an algorithm.
Open Source Code	Yes	Code for our method and experiments is available at: https://github.com/Mihneaghitu/Model Guidance Via Robust Feature Attribution.
Open Datasets	Yes	We conduct experiments on six datasets... The datasets include three synthetic ones and three real-world datasets: Decoy MNIST (Ross et al., 2017), Decoy DERM (a variant of Derm MNIST (Yang et al., 2023) created similarly to Decoy MNIST), Decoy IMDB (a text dataset (Maas et al., 2011) created by mimicking Decoy MNIST in a discrete space), ISIC (Codella et al., 2019), Plant Phenotyping, and Salient Image Net (Singla et al., 2022).
Dataset Splits	Yes	The IMDB dataset (Maas et al., 2011) is made up of 50000 IMDB movie text reviews, half of which (25000) are in the train set and the other half (25000) in the test set, and induces a binary classification sentiment analysis task.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies	No	The paper mentions using Python for implementation (implied by PyTorch-like model architectures with `torch.nn.ReLU()`) but does not provide specific version numbers for Python, PyTorch, or any other libraries or software dependencies.
Experiment Setup	No	The paper mentions the use of the Adam optimizer and discusses hyperparameters like \lambda and \beta for the loss function, and varying weight decay coefficients. However, it does not provide specific numerical values for critical experimental setup details such as learning rates, batch sizes, number of training epochs, or the exact values of \lambda, \beta, and weight decay used for the main results.