reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Group-robust Sample Reweighting for Subpopulation Shifts via Influence Functions

Authors: Rui Qiao, Zhaoxuan Wu, Jingtan Wang, Pang Wei Koh, Bryan Kian Hsiang Low

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 EXPERIMENTS Datasets. We evaluate the effectiveness of algorithms on 4 commonly used datasets. Waterbirds (Wah et al., 2011) is a binary object recognition dataset for bird types (i.e., waterbird, landbird), which are spuriously correlated with the background (i.e., water, land). Celeb A (Liu et al., 2015) is a binary object recognition dataset for hair color blondness prediction. Multi NLI (Williams et al., 2017) is a multi-class natural language inference dataset. Civil Comments (WILDS) (Borkan et al., 2019; Koh et al., 2021) is a binary text toxicity detection dataset. Table 1: Performance comparison. We report the worst-group accuracy on four benchmark datasets.
Researcher Affiliation	Academia	Rui Qiao1 2 Zhaoxuan Wu2 Jingtan Wang1 3 Pang Wei Koh4 Bryan Kian Hsiang Low1 2 1National University of Singapore 2Singapore-MIT Alliance for Research and Technology 3Agency for Science, Technology and Research (A*STAR) 4University of Washington EMAIL EMAIL EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Group-robust Sample Reweighting with last-layer retraining (GSR)
Open Source Code	Yes	Our code is available at https://github.com/qiaoruiyt/GSR.
Open Datasets	Yes	Datasets. We evaluate the effectiveness of algorithms on 4 commonly used datasets. Waterbirds (Wah et al., 2011) is a binary object recognition dataset... Celeb A (Liu et al., 2015) is a binary object recognition dataset... Multi NLI (Williams et al., 2017) is a multi-class natural language inference dataset... Civil Comments (WILDS) (Borkan et al., 2019; Koh et al., 2021) is a binary text toxicity detection dataset.
Dataset Splits	Yes	Overall, we follow the standard train, validation, and test splits for all datasets. We randomly create target and validation sets by equally splitting the original validation. We take out a random subset (e.g., 10%) from the training set as a held-out set Dtr-h.
Hardware Specification	Yes	For all experiments, we run on machines with NVIDIA RTX 3080 (10GB) / RTX A5000 (24GB) GPU and AMD EPYC 7543 CPU.
Software Dependencies	No	As we used Pytorch for our implementation and the last-layer retraining has very few parameters, the per-sample gradient and the Hessian inverse calculation involves high CPU usage, even though the tensors are on GPU. For model architectures and initialization, we use Image Net-pretrained Res Net-50 (V1) (He et al., 2016) for Waterbirds and Celeb A datasets, and use BERT (Hugging Face) for Multi NLI and Civil Comments.
Experiment Setup	Yes	For stage 1, we use almost the same hyperparameters as used by DFR (some are slightly altered for better consistency and simplicity by heuristics without any tuning) and record them in Table 2. For stage 2, there are two sets of hyperparameters for the inner and the outer loops. We perform a non-exhaustive search by randomly sampling the combination of hyperparameters from the grid. We select the best-performing model checkpoints and hyperparameter configurations based on the worst-group validation accuracy. The selected hyperparameter configurations are documented in Tables 3 and 4 for the outer loop and the inner loop respectively.