Boosting Test Performance with Importance Sampling--a Subpopulation Perspective
Authors: Hongyu Shen, Zhizhen Zhao
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare different DBCM variants (see Sec. 3.2) benchmarking models with three benchmarking datasets. We showcase the SOTA performance of our models, to demonstrate the consistency of the theory developed in Sec. 3. In addition, we provide experimental evidence that complements the theory on explaining why existing works would sacrifice average accuracy for higher worst group accuracy. |
| Researcher Affiliation | Academia | Hongyu Shen, Zhizhen Zhao Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Champaign, IL, 61820, U.S.A. EMAIL |
| Pseudocode | Yes | Algorithm 1 The universal algorithm for optimizing q(y|x, Itr). Input The initialized model q(y|x, Itr); dataset Dtr; The estimation ˆp(s|y, x, Itr). Output: the optimized q(y|x, Itr). 1: Obtain ˆg(x, y, Itr, Ite) given ˆp(s = y|y, x, Itr) (see Eq. (5)). 2: Perform the following optimization using Dtr: max q Mtr E(x,y) p(x,y|Itr)[ˆg(x, y, Itr, Ite) log q(y|x, Itr)]. (8) |
| Open Source Code | Yes | Code https://github.com/skyve2012/DBA |
| Open Datasets | Yes | Specifically, we consider two vision datasets: Waterbirds (Sagawa et al. 2020) and Color MNIST (Nam et al. 2020; Tsirigotis et al. 2024), and one language dataset: Civil Comments (Borkan et al. 2019), in order to cover the two popular data types. |
| Dataset Splits | No | The paper does not explicitly provide specific training/test/validation dataset splits (e.g., exact percentages or sample counts). It refers to using models and datasets prepared by Yang et al. (2023) but does not detail the splits used within this paper. It mentions specific ratios for Color MNIST (0.5% and 2%) related to spurious samples, which is a characteristic of the dataset setup, not a general train/val/test split. |
| Hardware Specification | No | The paper mentions "computational resources supported by the National Science Foundation s Major Research Instrumentation program, grant #1725729, as well as the University of Illinois at Urbana-Champaign." However, this is a general acknowledgment of resources and does not specify any particular GPU models, CPU models, or other specific hardware used for the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or other libraries). |
| Experiment Setup | Yes | In our experiment, we consider ratios 2% and 0.5%, as they are the intermediate and the hardest setups. In practice, we also treat p(m1|Itr) and p(m0|Itr) serve as prior knowledge/hyperparameters of training composition. Specifically for Color MNIST, where spurious sample ratio is known, we directly assign 0.5% or 2% for p(m0|Itr) (i.e., 1 p(m1|Itr)). When the composition ratio is unknown, p(m0|Itr) is treated as a hyperparameter and empirically we identify p(m0|Itr) = 0.85 performed well across datasets. For model optimization, we consider default optimizers and learning rates in Yang et al. (2023). Details are in Appendix. |