reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Finding and Fixing Spurious Patterns with Explanations

Authors: Gregory Plumb, Marco Tulio Ribeiro, Ameet Talwalkar

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that our method identifies a diverse set of spurious patterns and that it mitigates them by producing a model that is both more accurate on a distribution where the spurious pattern is not helpful and more robust to distribution shift. We divide our experiments into three groups: In Section 5.1, we induce SPs with varying strengths by sub-sampling COCO in order to understand how mitigation methods work in a controlled setting. We show that SPIRE is more effective at mitigating these SPs than prior methods.
Researcher Affiliation	Collaboration	Gregory Plumb EMAIL CMU, Marco Tulio Ribeiro EMAIL Microsoft Research, Ameet Talwalkar EMAIL CMU. Carnegie Mellon University is an academic institution, and Microsoft Research is an industry research lab, indicating a collaboration.
Pseudocode	Yes	Algorithm 1 details the process that we use for adding or removing Spurious from a model s representation. (found in Appendix H.1 SPIRE-R Projection Pseudo Code)
Open Source Code	Yes	1Code is available at https://github.com/GDPlumb/SPIRE
Open Datasets	Yes	model trained to detect tennis rackets on the COCO dataset (Lin et al., 2014), Un Rel (Peyre et al., 2017) and Spatial Sense (Yang et al., 2019), ISIC dataset (Codella et al., 2019)
Dataset Splits	Yes	Because the test set for this dataset is not publicly available, we used its validation set as our test set and divided its training set into 90-10 training and validation splits. we created a series of controlled training sets of size 2000 by sampling images from the full training set such that P(Main) = P(Spurious) = 0.5 and p = P(Main \| Spurious) ranges between 0.025 and 0.975.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models or processor types used for running its experiments. It only mentions the model architecture (ResNet18) and software framework (PyTorch).
Software Dependencies	No	All of our experiments started with the pretrained Res Net18 (He et al., 2016) that is available from Py Torch (Paszke et al., 2019). We minimized the binary cross entropy loss using Adam (Kingma & Ba, 2014). The paper mentions PyTorch and Adam but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	We minimized the binary cross entropy loss using Adam (Kingma & Ba, 2014) with a batch size of 64. For transfer-learning, we used a learning rate of 0.001 and, for fine-tuning, we used a learning rate of 0.0001; we explored other options during early experiments, but found there was no benefit to doing so. If the training loss failed to decrease sufficiently after some number of epochs, we lowered the learning rate. For SPIRE, we considered both removing objects by covering them with a grey box and by in-painting them; we found that transfer-learning while covering objects with a grey box was the most effective (see Table 9). RRR, CDEP, and GS all have regularization weights that can be tuned. FS has a minimum weight for images of objects out of context that can be tuned. For these methods, we considered values that are powers of 10 ranging from 0.1 to 10,000; no method chose one of the extreme values.