Group-robust Sample Reweighting for Subpopulation Shifts via Influence Functions

Authors: Rui Qiao, Zhaoxuan Wu, Jingtan Wang, Pang Wei Koh, Bryan Kian Hsiang Low

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 EXPERIMENTS Datasets. We evaluate the effectiveness of algorithms on 4 commonly used datasets. Waterbirds (Wah et al., 2011) is a binary object recognition dataset for bird types (i.e., waterbird, landbird), which are spuriously correlated with the background (i.e., water, land). Celeb A (Liu et al., 2015) is a binary object recognition dataset for hair color blondness prediction. Multi NLI (Williams et al., 2017) is a multi-class natural language inference dataset. Civil Comments (WILDS) (Borkan et al., 2019; Koh et al., 2021) is a binary text toxicity detection dataset. Table 1: Performance comparison. We report the worst-group accuracy on four benchmark datasets.
Researcher Affiliation Academia Rui Qiao1 2 Zhaoxuan Wu2 Jingtan Wang1 3 Pang Wei Koh4 Bryan Kian Hsiang Low1 2 1National University of Singapore 2Singapore-MIT Alliance for Research and Technology 3Agency for Science, Technology and Research (A*STAR) 4University of Washington EMAIL EMAIL EMAIL EMAIL
Pseudocode Yes Algorithm 1 Group-robust Sample Reweighting with last-layer retraining (GSR)
Open Source Code Yes Our code is available at https://github.com/qiaoruiyt/GSR.
Open Datasets Yes Datasets. We evaluate the effectiveness of algorithms on 4 commonly used datasets. Waterbirds (Wah et al., 2011) is a binary object recognition dataset... Celeb A (Liu et al., 2015) is a binary object recognition dataset... Multi NLI (Williams et al., 2017) is a multi-class natural language inference dataset... Civil Comments (WILDS) (Borkan et al., 2019; Koh et al., 2021) is a binary text toxicity detection dataset.
Dataset Splits Yes Overall, we follow the standard train, validation, and test splits for all datasets. We randomly create target and validation sets by equally splitting the original validation. We take out a random subset (e.g., 10%) from the training set as a held-out set Dtr-h.
Hardware Specification Yes For all experiments, we run on machines with NVIDIA RTX 3080 (10GB) / RTX A5000 (24GB) GPU and AMD EPYC 7543 CPU.
Software Dependencies No As we used Pytorch for our implementation and the last-layer retraining has very few parameters, the per-sample gradient and the Hessian inverse calculation involves high CPU usage, even though the tensors are on GPU. For model architectures and initialization, we use Image Net-pretrained Res Net-50 (V1) (He et al., 2016) for Waterbirds and Celeb A datasets, and use BERT (Hugging Face) for Multi NLI and Civil Comments.
Experiment Setup Yes For stage 1, we use almost the same hyperparameters as used by DFR (some are slightly altered for better consistency and simplicity by heuristics without any tuning) and record them in Table 2. For stage 2, there are two sets of hyperparameters for the inner and the outer loops. We perform a non-exhaustive search by randomly sampling the combination of hyperparameters from the grid. We select the best-performing model checkpoints and hyperparameter configurations based on the worst-group validation accuracy. The selected hyperparameter configurations are documented in Tables 3 and 4 for the outer loop and the inner loop respectively.