reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

For Robust Worst-Group Accuracy, Ignore Group Annotations

Authors: Nathan Stromberg, Rohan Ayyagari, Monica Welfert, Sanmi Koyejo, Richard Nock, Lalitha Sankar

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This is further confirmed with numerical experiments for a synthetic Gaussian mixture dataset modeling latent representations. We test RAD-UW on several large publicly available datasets and demonstrate that it achieves SOTA WGA even with noisy domain annotations. We present worst-group accuracies for several representative methods across four large publicly available datasets.
Researcher Affiliation	Collaboration	Nathan Stromberg EMAIL Arizona State University, Rohan Ayyagari EMAIL Arizona State University, Monica Welfert EMAIL Arizona State University, Sanmi Koyejo EMAIL Stanford University, Richard Nock EMAIL Google Research, Lalitha Sankar EMAIL Arizona State University
Pseudocode	Yes	Pseudocode for this algorithm is presented in Algorithm 1. Algorithm 1 Regularized Annotation of Domains (RAD). Algorithm 2 RAD-UW.
Open Source Code	No	The paper does not provide an explicit statement about releasing its own source code, nor does it provide a direct link to a code repository for the methodology described.
Open Datasets	Yes	We present worst-group accuracies for several representative methods across four large publicly available datasets. CMNIST (Arjovsky et al., 2019) is a variant of the MNIST handwritten digit dataset. Celeb A (Liu et al., 2015) is a dataset of celebrity faces. Waterbirds (Sagawa et al., 2020) is a semi-synthetic dataset. Multi NLI (Williams et al., 2018) is a text corpus dataset. Civil Comments (Borkan et al., 2019) is a text corpus dataset of public comments on news websites.
Dataset Splits	Yes	Following prior work (Kirichenko et al., 2023; La Bonte et al., 2023), we use half of the validation as retraining data, i.e., training data for only the last layer, and half as a clean holdout. Table 2: Dataset splits (provides quantitative splits for Train, Val, Test sets).
Hardware Specification	No	The paper does not mention any specific hardware (GPU, CPU models, memory details, or cloud resources) used for running the experiments.
Software Dependencies	No	We use the logistic regression implementation from the scikit-learn (Pedregosa et al., 2011) package for the retraining step for all presented methods. RAD-UW uses a regularized linear model implemented with pytorch for the pseudo-annotation of domain labels. The pseudo-annotation model uses a weight decay of 1e 3 with the Adam W optimizer from pytorch. Cosine Annealing LR learning rate scheduler from pytorch. linear learning rate scheduler imported from the transformers library.
Experiment Setup	Yes	For all methods, we tune the inverse of λ, where λ is the regularization strength, over 20 (equally-spaced on a log scale) values ranging from 1e 4 to 1. For all final retraining steps (including LLR) an ℓ1 regularization is added. We additionally tune the regularization strength λ of the retraining model along with the upweighting factor c. The pseudo-annotation model uses a weight decay of 1e 3 with the Adam W optimizer from pytorch. For all datasets except Waterbirds, the pseudo-annotation model is trained for 6 epochs. Waterbirds is trained for 60 epochs. For all datasets except CMNIST, we tune the inverse of λID for the pseudo-annotation model over 20 (equally-spaced on a log scale) values ranging from 1e 7 to 1e 3. For CMNIST, we tune the inverse of λID 20 (equally-spaced on a log scale) values ranging from 1e 1 to 1e2. For the retraining model, we tune the inverse of λ over 20 (equally-spaced on a log scale) values ranging from 1e 4 to 1.