Dimension-Free Adaptive Subgradient Methods with Frequent Directions

Authors: Sifan Yang, Yuanyu Wan, Peijia Li, Yibo Wang, Xiao Zhang, Zhewei Wei, Lijun Zhang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results have verified the efficiency and effectiveness of our approaches. Finally, we conduct experiments on online classification and neural network training to validate the superiority of our methods.
Researcher Affiliation Academia 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 2School of Artificial Intelligence, Nanjing University, Nanjing, China 3School of Software Technology, Zhejiang University, Ningbo, China 4Hangzhou High Tech Zone (Binjiang) Institute of Blockchain and Data Security, Hangzhou, China 5Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 6Pazhou Laboratory (Huangpu), Guangzhou, China. Correspondence to: Lijun Zhang <EMAIL>.
Pseudocode Yes Algorithm 1 Frequent Directions (FD) Algorithm 2 Follow the Sketchy Leader (FTSL) Algorithm 3 Follow the Fast Sketchy Leader (FTFSL) Algorithm 4 Frequent Directions in General Form Algorithm 5 FTSL-Shampoo Algorithm 6 Online to Batch Conversion
Open Source Code No The paper does not provide any explicit statements about releasing source code, nor does it include links to code repositories.
Open Datasets Yes First, we perform online classification to evaluate the performance of our methods with two real-world datasets from LIBSVM (Chang & Lin, 2011) repository: Gisette and Epsilon... The experiments involve training Res Net18 and Res Net34 models (He et al., 2016) on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009)... Concretely, we train a 2-layer Transformer (Vaswani et al., 2017) over the Wi Ki-Text2 dataset (Merity, 2016).
Dataset Splits Yes For Gisette dataset, we set the batch size n = 32, the sketching size τ = 50 to be 1% of the original dimensionality, and T = 2000... Epsilon dataset consists of 400, 000 training samples and 100, 000 testing samples... The experiments involve training Res Net18 and Res Net34 models (He et al., 2016) on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009), respectively, for 200 iterations with batch size of 128... The batch size is set as 64 and all methods are trained for 40 epochs with dropout rate 0.1.
Hardware Specification Yes All experiments are conducted on 8 NVIDIA 3090 GPUs.
Software Dependencies No The paper mentions using LIBSVM but does not provide specific version numbers for any software, libraries, or frameworks used in the implementation.
Experiment Setup Yes For Gisette dataset, we set the batch size n = 32, the sketching size τ = 50 to be 1% of the original dimensionality, and T = 2000. For Epsilon dataset, we set the batch size n = 128, τ = 20 and T = 5000... For ADA-FFD, S-ADA, FTFSL, the sketching size τ is determined based on the dimensionality of the flattened gradient, which is defined as: τ = min{ d 0.1 , 100}... For S-Shampoo and FTSL-Shampoo, due to its memory efficiency, we set τ = 0.1 di... We use 256 dimensional word embeddings, 256 hidden unites and 2 heads. We also clip the gradients by norm 0.5 in case of the exploding gradient. The batch size is set as 64 and all methods are trained for 40 epochs with dropout rate 0.1.