Stabilizing black-box model selection with the inflated argmax
Authors: Melissa Adrian, Jake A Soloff, Rebecca Willett
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate this method in (a) a simulation in which strongly correlated covariates make standard LASSO model selection highly unstable, (b) a Lotka Volterra model selection problem focused on identifying how competition in an ecosystem influences species abundances, (c) a graph subset selection problem using cell-signaling data from proteomics, and (d) unsupervised κ-means clustering. In these settings, the proposed method yields stable, compact, and accurate collections of selected models, outperforming a variety of benchmarks. |
| Researcher Affiliation | Academia | Melissa Adrian EMAIL Data Science Institute, University of Chicago Jake A. Soloff EMAIL Department of Statistics, University of Michigan Rebecca Willett EMAIL Department of Statistics, University of Chicago Department of Computer Science, University of Chicago NSF-Simons National Institute for Theory and Mathematics in Biology |
| Pseudocode | Yes | Algorithm 1 Bagged model selection (Breiman, 1996a;b) Algorithm 2 Computing the number of clusters |
| Open Source Code | No | The paper mentions using 'pysindy Python package' and 'sklearn.tree.Decision Tree Classifier function', which are third-party tools. There is no explicit statement or link provided by the authors for the release of their own source code for the methodology described in this paper. |
| Open Datasets | Yes | We generate synthetic datasets... We provide further details of this data generation process in D.1. We compute LOO stability results from a flow cytometry dataset in Sachs et al. (2005) Mouse embryonic stem cells were sequenced for their gene expression... (Veleslavov and Stumpf, 2020). |
| Dataset Splits | Yes | N = 100 trials (i.e., independent datasets) are independently generated according to the same data generation process. We perform a grid search across two parameters to find a combination that leads to a low validation MSE. The 5-fold validation MSE is measured as... We compute LOO stability results from a flow cytometry dataset |
| Hardware Specification | No | In our experiments, we utilize a cluster computing system to distribute parallel jobs across CPU nodes. No specific CPU models, memory, or other detailed hardware specifications are provided. |
| Software Dependencies | No | We utilize the pysindy Python package (Kaptanoglu et al., 2022; de Silva et al., 2020) for their implementations of these methods... Our experiments use the default hyper-parameters in the sklearn.tree.Decision Tree Classifier function. No specific version numbers are provided for these software components. |
| Experiment Setup | Yes | We choose the hyperparameter combination that should give sparser models (larger λ and larger ω) in the case of tied validation MSE, which leads us to choose λ = 0.01 and ω = 0.18. Based on this figure, the best choice of λ is λ = 77, which we keep constant throughout our experiments in 5.2.1. In our generated example, the maximum number of clusters M = 29, and the slope tolerance ω = 5. Our experiments use the default hyper-parameters in the sklearn.tree.Decision Tree Classifier function. |