Stacey: Promoting Stochastic Steepest Descent via Accelerated $\ell_p$-Smooth Nonconvex Optimization
Authors: Xinyu Luo, Site Bai, Bolian Li, Petros Drineas, Ruqi Zhang, Brian Bullins
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present empirical evidence that the STACEY optimizer outperforms other optimizers in both convergence speed and accuracy. We evaluate STACEY s effectiveness on image classification (Section 5.1), and LLM pretraining (Section 5.2). The hyperparameter choices and tuning are summarized in Appendix C. Table 1. Image classification on CIFAR at the 50th, 100th, and 200th epochs. STACEY consistently outperforms other optimizers, demonstrating both superior accuracy and faster convergence. Table 2. Image classification on Image Net at the 20th, 40th, and 60th epochs. STACEY demonstrates superior test accuracy and faster convergence compared to other optimizers. Figure 1. Learning curves of CIFAR classification with varying ℓp-norm. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Purdue University, Indiana, USA. Correspondence to: Xinyu Luo <EMAIL>, Cedar Site Bai <EMAIL>, Bolian Li <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 STACEY(p,2) Optimizer Algorithm 2 STACEY(p,p) Optimizer Algorithm 3 Stochastic ℓp Descent |
| Open Source Code | Yes | Code can be found at https:// github.com/xinyuluo8561/Stacey. |
| Open Datasets | Yes | We train Res Net18 (He et al., 2016) on the CIFAR dataset (Krizhevsky, 2009) for 200 epochs... We train Res Net50 (He et al., 2016) with a batch size 256 on Image Net (Deng et al., 2009)... We pretrain llama-100m (Touvron et al., 2023) on the C4 subset. |
| Dataset Splits | Yes | We train Res Net18 (He et al., 2016) on the CIFAR dataset (Krizhevsky, 2009) for 200 epochs... We train Res Net50 (He et al., 2016) with a batch size 256 on Image Net (Deng et al., 2009) for 60 epochs. |
| Hardware Specification | No | Due to computational resource limitations, the batch sizes used in this paper are smaller than those in Lion s original paper (Chen et al., 2024). |
| Software Dependencies | No | The paper mentions optimizers like SGD, Adam, Adam W, and Lion and refers to |
| Experiment Setup | Yes | The hyperparameter choices and tuning are summarized in Appendix C. We summarize the hyperparameters used in our experiments in Tables 4, 5, and 6. These hyperparameters are determined through a grid search. Specifically, we perform a search to identify appropriate values for the ℓp-norm, learning rate η, α, and weight decay λ. This process involves an initial rough comparison across a range of magnitudes, followed by a more precise grid search to determine the optimal values. |