For Robust Worst-Group Accuracy, Ignore Group Annotations
Authors: Nathan Stromberg, Rohan Ayyagari, Monica Welfert, Sanmi Koyejo, Richard Nock, Lalitha Sankar
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This is further confirmed with numerical experiments for a synthetic Gaussian mixture dataset modeling latent representations. We test RAD-UW on several large publicly available datasets and demonstrate that it achieves SOTA WGA even with noisy domain annotations. We present worst-group accuracies for several representative methods across four large publicly available datasets. |
| Researcher Affiliation | Collaboration | Nathan Stromberg EMAIL Arizona State University, Rohan Ayyagari EMAIL Arizona State University, Monica Welfert EMAIL Arizona State University, Sanmi Koyejo EMAIL Stanford University, Richard Nock EMAIL Google Research, Lalitha Sankar EMAIL Arizona State University |
| Pseudocode | Yes | Pseudocode for this algorithm is presented in Algorithm 1. Algorithm 1 Regularized Annotation of Domains (RAD). Algorithm 2 RAD-UW. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its own source code, nor does it provide a direct link to a code repository for the methodology described. |
| Open Datasets | Yes | We present worst-group accuracies for several representative methods across four large publicly available datasets. CMNIST (Arjovsky et al., 2019) is a variant of the MNIST handwritten digit dataset. Celeb A (Liu et al., 2015) is a dataset of celebrity faces. Waterbirds (Sagawa et al., 2020) is a semi-synthetic dataset. Multi NLI (Williams et al., 2018) is a text corpus dataset. Civil Comments (Borkan et al., 2019) is a text corpus dataset of public comments on news websites. |
| Dataset Splits | Yes | Following prior work (Kirichenko et al., 2023; La Bonte et al., 2023), we use half of the validation as retraining data, i.e., training data for only the last layer, and half as a clean holdout. Table 2: Dataset splits (provides quantitative splits for Train, Val, Test sets). |
| Hardware Specification | No | The paper does not mention any specific hardware (GPU, CPU models, memory details, or cloud resources) used for running the experiments. |
| Software Dependencies | No | We use the logistic regression implementation from the scikit-learn (Pedregosa et al., 2011) package for the retraining step for all presented methods. RAD-UW uses a regularized linear model implemented with pytorch for the pseudo-annotation of domain labels. The pseudo-annotation model uses a weight decay of 1e 3 with the Adam W optimizer from pytorch. Cosine Annealing LR learning rate scheduler from pytorch. linear learning rate scheduler imported from the transformers library. |
| Experiment Setup | Yes | For all methods, we tune the inverse of λ, where λ is the regularization strength, over 20 (equally-spaced on a log scale) values ranging from 1e 4 to 1. For all final retraining steps (including LLR) an ℓ1 regularization is added. We additionally tune the regularization strength λ of the retraining model along with the upweighting factor c. The pseudo-annotation model uses a weight decay of 1e 3 with the Adam W optimizer from pytorch. For all datasets except Waterbirds, the pseudo-annotation model is trained for 6 epochs. Waterbirds is trained for 60 epochs. For all datasets except CMNIST, we tune the inverse of λID for the pseudo-annotation model over 20 (equally-spaced on a log scale) values ranging from 1e 7 to 1e 3. For CMNIST, we tune the inverse of λID 20 (equally-spaced on a log scale) values ranging from 1e 1 to 1e2. For the retraining model, we tune the inverse of λ over 20 (equally-spaced on a log scale) values ranging from 1e 4 to 1. |