reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scalable First-order Method for Certifying Optimal k-Sparse GLMs

Authors: Jiachang Liu, Soroosh Shafiee, Andrea Lodi

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on synthetic and real-world datasets show that our approach significantly accelerates dual bound computations and is highly effective in providing optimality certificates for large-scale problems. We evaluate our proposed methods using both synthetic and real-world datasets to address three key empirical questions: How fast is our customized PAVA algorithm in evaluating proxg compared to existing solvers? How fast is our proposed FISTA method in calculating the lower bounds compared to existing solvers? How fast is our customized Bn B algorithm compared to existing solvers?
Researcher Affiliation	Academia	1School of Operations Research and Information Engineering, Cornell University, Ithaca, NY, USA 2Jacobs Technion-Cornell Institute, Cornell Tech and Technion IIT, New York, NY, USA. Correspondence to: Jiachang Liu <EMAIL>, Soroosh Shafiee <EMAIL>, Andrea Lodi <EMAIL>.
Pseudocode	Yes	Algorithm 1 Customized PAVA to solve proxρg (µ) ... Algorithm 2 Algorithm to compute g(β) ... Algorithm 3 Main algorithm to solve problem (5) ... Algorithm 4 Up and Down Block Algorithm for Merging in PAVA
Open Source Code	Yes	Implementations discussed in this paper are available at https://github.com/jiachangliu/OKGLM.
Open Datasets	Yes	For the real-world datasets, we use the dataset cancer drug response (Liu et al., 2020) for linear regression and DOROTHEA (Asuncion & Newman, 2007) for logistic regression.
Dataset Splits	Yes	For the results in the main paper, we set n-to-p ratio to be 1.0, the box constraint M to be 2, the number of nonzero coefficients k (also the cardinality constraint) to be 10, and ℓ2 regularization coefficient λ2 to be 1.0. ... The k values for both real-world datasets are selected based on doing 5 fold cross validation with a heuristic sparse learning algorithm first.
Hardware Specification	Yes	When investigating how much GPU can accelerate our computation, we ran the experiments with both CPU and GPU implementations on the Nvidia A100s. For everything else, we ran the experiments with the CPU implementation on AMD Milan with CPU speed 2.45 Ghz and 8 cores.
Software Dependencies	Yes	We implement our algorithms in python. For baselines, we compare with the following state-of-the-art commercial and open-source SOCP solvers: Gurobi (Gurobi Optimization, LLC, 2025), MOSEK (Ap S, 2025), SCS (O Donoghue et al., 2023), and Clarabel (Goulart & Chen, 2024), with the python package cvxpy (Diamond & Boyd, 2016) as the interface to these solvers.
Experiment Setup	Yes	Input: number of iterations T, coefficient λ2 for the ℓ2 regularization, and step size L (Lipschitz-continuity parameter of F(β)) ... For the cardinality constraint k, we set k = 10 for both synthetic datasets... For the cancer drug response dataset, we set k = 5. For DOROTHEA, we set k = 10. ... For the ℓ2 regularization coefficient, we set λ2 = 1. For the box constraint, we set M = 2 for the synthetic datasets and DOROTHEA. ... For the cancer drug response dataset, we set M = 5