reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Boosting Test Performance with Importance Sampling--a Subpopulation Perspective

Authors: Hongyu Shen, Zhizhen Zhao

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare different DBCM variants (see Sec. 3.2) benchmarking models with three benchmarking datasets. We showcase the SOTA performance of our models, to demonstrate the consistency of the theory developed in Sec. 3. In addition, we provide experimental evidence that complements the theory on explaining why existing works would sacrifice average accuracy for higher worst group accuracy.
Researcher Affiliation	Academia	Hongyu Shen, Zhizhen Zhao Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Champaign, IL, 61820, U.S.A. EMAIL
Pseudocode	Yes	Algorithm 1 The universal algorithm for optimizing q(y\|x, Itr). Input The initialized model q(y\|x, Itr); dataset Dtr; The estimation ˆp(s\|y, x, Itr). Output: the optimized q(y\|x, Itr). 1: Obtain ˆg(x, y, Itr, Ite) given ˆp(s = y\|y, x, Itr) (see Eq. (5)). 2: Perform the following optimization using Dtr: max q Mtr E(x,y) p(x,y\|Itr)[ˆg(x, y, Itr, Ite) log q(y\|x, Itr)]. (8)
Open Source Code	Yes	Code https://github.com/skyve2012/DBA
Open Datasets	Yes	Specifically, we consider two vision datasets: Waterbirds (Sagawa et al. 2020) and Color MNIST (Nam et al. 2020; Tsirigotis et al. 2024), and one language dataset: Civil Comments (Borkan et al. 2019), in order to cover the two popular data types.
Dataset Splits	No	The paper does not explicitly provide specific training/test/validation dataset splits (e.g., exact percentages or sample counts). It refers to using models and datasets prepared by Yang et al. (2023) but does not detail the splits used within this paper. It mentions specific ratios for Color MNIST (0.5% and 2%) related to spurious samples, which is a characteristic of the dataset setup, not a general train/val/test split.
Hardware Specification	No	The paper mentions "computational resources supported by the National Science Foundation s Major Research Instrumentation program, grant #1725729, as well as the University of Illinois at Urbana-Champaign." However, this is a general acknowledgment of resources and does not specify any particular GPU models, CPU models, or other specific hardware used for the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or other libraries).
Experiment Setup	Yes	In our experiment, we consider ratios 2% and 0.5%, as they are the intermediate and the hardest setups. In practice, we also treat p(m1\|Itr) and p(m0\|Itr) serve as prior knowledge/hyperparameters of training composition. Specifically for Color MNIST, where spurious sample ratio is known, we directly assign 0.5% or 2% for p(m0\|Itr) (i.e., 1 p(m1\|Itr)). When the composition ratio is unknown, p(m0\|Itr) is treated as a hyperparameter and empirically we identify p(m0\|Itr) = 0.85 performed well across datasets. For model optimization, we consider default optimizers and learning rates in Yang et al. (2023). Details are in Appendix.