reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ShortcutProbe: Probing Prediction Shortcuts for Learning Robust Models

Authors: Guangtao Zheng, Wenqian Ye, Aidong Zhang

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We theoretically analyze the effectiveness of the framework and empirically demonstrate that it is an efficient and practical tool for improving a model s robustness to spurious bias on diverse datasets. Through extensive experiments, we show that our method successfully trains models robust to spurious biases without prior knowledge about these biases. Section 5: Experiments, Section 5.1: Datasets, Section 5.2: Experimental Setup, Section 5.3: Analysis of Probe Set, Section 5.4: Main Results (Tables 1, 2, 3), Section 5.5: Ablation Studies (Figure 3).
Researcher Affiliation	Academia	Guangtao Zheng , Wenqian Ye and Aidong Zhang University of Virginia EMAIL
Pseudocode	No	The paper describes the methodology using prose and mathematical equations. It mentions "Details of the training algorithm are provided in Appendix." but the appendix content is not provided in the analyzed text. Therefore, no structured pseudocode or algorithm blocks are present in the provided paper text.
Open Source Code	Yes	Code is available at https://github.com/gtzheng/Shortcut Probe.
Open Datasets	Yes	Waterbirds [Sagawa et al., 2019], Celeb A [Liu et al., 2015], Che Xpert [Irvin et al., 2019], Image Net-9 [Ilyas et al., 2019] is a subset of Image Net [Deng et al., 2009], Image Net-A [Hendrycks et al., 2021], NICO [He et al., 2021], Multi NLI [Williams et al., 2017], Civil Comments [Borkan et al., 2019].
Dataset Splits	Yes	From the chosen data source, such as the training or validation set, we sorted the samples within each class by their prediction losses and divided them into two equal halves: a high-loss set and a low-loss set. ... Then, we retrained the model on half of the validation set using various bias mitigation methods. For our method, we first constructed the probe set using the same half of the validation set and used the probe set for shortcut detection and mitigation. The remaining half of the validation set was used for model selection and hyperparameter tuning. ... We prepared the training and validation data as in [Kim et al., 2022] and [Bahng et al., 2020].
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU or CPU models, memory specifications, or detailed computing environments used for the experiments.
Software Dependencies	No	The paper mentions using "Res Net-50 as the backbone network", "Res Net-18", and a "pretrained BERT model [Kenton and Toutanova, 2019]" but does not provide specific version numbers for these or any other core software libraries/frameworks.
Experiment Setup	Yes	We first trained a base model initialized with pretrained weights using empirical risk minimization (ERM) on the training dataset. Then, we retrained the model on half of the validation set... The remaining half of the validation set was used for model selection and hyperparameter tuning. ... ψ = arg min ψ Ldet + ηLreg, where η > 0 represents the regularization strength. ... θ 2 = arg min θ2 λLori/Lspu, where λ > 0 is the regularization strength. ... We retrain only the final classification layer of the model while keeping the feature extractor frozen.