Stacey: Promoting Stochastic Steepest Descent via Accelerated $\ell_p$-Smooth Nonconvex Optimization

Authors: Xinyu Luo, Site Bai, Bolian Li, Petros Drineas, Ruqi Zhang, Brian Bullins

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present empirical evidence that the STACEY optimizer outperforms other optimizers in both convergence speed and accuracy. We evaluate STACEY s effectiveness on image classification (Section 5.1), and LLM pretraining (Section 5.2). The hyperparameter choices and tuning are summarized in Appendix C. Table 1. Image classification on CIFAR at the 50th, 100th, and 200th epochs. STACEY consistently outperforms other optimizers, demonstrating both superior accuracy and faster convergence. Table 2. Image classification on Image Net at the 20th, 40th, and 60th epochs. STACEY demonstrates superior test accuracy and faster convergence compared to other optimizers. Figure 1. Learning curves of CIFAR classification with varying ℓp-norm.
Researcher Affiliation Academia 1Department of Computer Science, Purdue University, Indiana, USA. Correspondence to: Xinyu Luo <EMAIL>, Cedar Site Bai <EMAIL>, Bolian Li <EMAIL>.
Pseudocode Yes Algorithm 1 STACEY(p,2) Optimizer Algorithm 2 STACEY(p,p) Optimizer Algorithm 3 Stochastic ℓp Descent
Open Source Code Yes Code can be found at https:// github.com/xinyuluo8561/Stacey.
Open Datasets Yes We train Res Net18 (He et al., 2016) on the CIFAR dataset (Krizhevsky, 2009) for 200 epochs... We train Res Net50 (He et al., 2016) with a batch size 256 on Image Net (Deng et al., 2009)... We pretrain llama-100m (Touvron et al., 2023) on the C4 subset.
Dataset Splits Yes We train Res Net18 (He et al., 2016) on the CIFAR dataset (Krizhevsky, 2009) for 200 epochs... We train Res Net50 (He et al., 2016) with a batch size 256 on Image Net (Deng et al., 2009) for 60 epochs.
Hardware Specification No Due to computational resource limitations, the batch sizes used in this paper are smaller than those in Lion s original paper (Chen et al., 2024).
Software Dependencies No The paper mentions optimizers like SGD, Adam, Adam W, and Lion and refers to
Experiment Setup Yes The hyperparameter choices and tuning are summarized in Appendix C. We summarize the hyperparameters used in our experiments in Tables 4, 5, and 6. These hyperparameters are determined through a grid search. Specifically, we perform a search to identify appropriate values for the ℓp-norm, learning rate η, α, and weight decay λ. This process involves an initial rough comparison across a range of magnitudes, followed by a more precise grid search to determine the optimal values.