Stabilizing black-box model selection with the inflated argmax

Authors: Melissa Adrian, Jake A Soloff, Rebecca Willett

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate this method in (a) a simulation in which strongly correlated covariates make standard LASSO model selection highly unstable, (b) a Lotka Volterra model selection problem focused on identifying how competition in an ecosystem influences species abundances, (c) a graph subset selection problem using cell-signaling data from proteomics, and (d) unsupervised κ-means clustering. In these settings, the proposed method yields stable, compact, and accurate collections of selected models, outperforming a variety of benchmarks.
Researcher Affiliation Academia Melissa Adrian EMAIL Data Science Institute, University of Chicago Jake A. Soloff EMAIL Department of Statistics, University of Michigan Rebecca Willett EMAIL Department of Statistics, University of Chicago Department of Computer Science, University of Chicago NSF-Simons National Institute for Theory and Mathematics in Biology
Pseudocode Yes Algorithm 1 Bagged model selection (Breiman, 1996a;b) Algorithm 2 Computing the number of clusters
Open Source Code No The paper mentions using 'pysindy Python package' and 'sklearn.tree.Decision Tree Classifier function', which are third-party tools. There is no explicit statement or link provided by the authors for the release of their own source code for the methodology described in this paper.
Open Datasets Yes We generate synthetic datasets... We provide further details of this data generation process in D.1. We compute LOO stability results from a flow cytometry dataset in Sachs et al. (2005) Mouse embryonic stem cells were sequenced for their gene expression... (Veleslavov and Stumpf, 2020).
Dataset Splits Yes N = 100 trials (i.e., independent datasets) are independently generated according to the same data generation process. We perform a grid search across two parameters to find a combination that leads to a low validation MSE. The 5-fold validation MSE is measured as... We compute LOO stability results from a flow cytometry dataset
Hardware Specification No In our experiments, we utilize a cluster computing system to distribute parallel jobs across CPU nodes. No specific CPU models, memory, or other detailed hardware specifications are provided.
Software Dependencies No We utilize the pysindy Python package (Kaptanoglu et al., 2022; de Silva et al., 2020) for their implementations of these methods... Our experiments use the default hyper-parameters in the sklearn.tree.Decision Tree Classifier function. No specific version numbers are provided for these software components.
Experiment Setup Yes We choose the hyperparameter combination that should give sparser models (larger λ and larger ω) in the case of tied validation MSE, which leads us to choose λ = 0.01 and ω = 0.18. Based on this figure, the best choice of λ is λ = 77, which we keep constant throughout our experiments in 5.2.1. In our generated example, the maximum number of clusters M = 29, and the slope tolerance ω = 5. Our experiments use the default hyper-parameters in the sklearn.tree.Decision Tree Classifier function.