Structured Preconditioners in Adaptive Optimization: A Unified Analysis
Authors: Shuo Xie, Tianhao Wang, Sashank J. Reddi, Sanjiv Kumar, Zhiyuan Li
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our unified analysis challenges this prevailing view and reveals, perhaps surprisingly, that more structured preconditioners, despite using less space and computation per step, can outperform their less structured counterparts. To demonstrate this, we show that one-sided Shampoo, which is relatively much cheaper than fullmatrix Ada Grad could outperform it both theoretically and experimentally. ... In this section we empirically demonstrate the superior performance of 1-sided shampoo over other variants of Ada Reg (Algorithm 1) on a simple but natural setting. Moreover, such superior performance is predicted by our theoretical analysis in Section 4.3, which in turn validates the practical utility of our theory in guiding optimizer selection. |
| Researcher Affiliation | Collaboration | 1Toyota Technological Institute at Chicago 2Google Research. Correspondence to: Shuo Xie <EMAIL>, Zhiyuan Li <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Adaptive Regularization Meta-Algorithm Ada Reg (Gupta et al., 2017) ... Algorithm 2 One-sided Shampoo ... Algorithm 3 Two-sided Shampoo (Gupta et al., 2018) ... Algorithm 4 EMA Adaptive Regularization Meta Algorithm ... Algorithm 5 Weighted Adaptive Regularization Meta Algorithm |
| Open Source Code | No | The paper does not provide any specific link to a code repository, an explicit statement about code release, or mention of code in supplementary materials for the methodology described. |
| Open Datasets | No | We consider a linear regression problem AX y 2 2 where A is the data matrix and y = AX is the label vector generated by ground-truth X . Thus, the loss function can be equivalently written as f(X) = H, (X X )(X X ) . ... Each element of the solution X is independently sampled from N(0, 1 d). |
| Dataset Splits | No | The paper describes a synthetic linear regression problem where data and ground truth are generated. It does not mention any training, validation, or test splits for an external dataset, as the problem setup is self-contained. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper discusses various adaptive optimizers like Ada Grad-Norm, Ada Grad, Shampoo, Adam, but it does not specify the version numbers of any software libraries, frameworks, or programming languages used for their implementation or experiments. |
| Experiment Setup | Yes | We consider X Rd d with d = 103. We set the eigenvalues of H by σ1 = = σ10 = 1 and σi = 1 (i 10)2 for 11 i 103. Each element of the solution X is independently sampled from N(0, 1 d). We run Ada Grad-Norm, Ada Grad, one-sided Shampoo and fullmatrix Ada Grad for 100 steps from initialization X0 = 0. ... The learning rate is tuned over five seeds for last iterate loss and average iterate loss respectfully, selecting the one with the lowest average loss. We use precision float32 and set ϵ = 0 for all the experiments. |