A Coefficient Makes SVRG Effective
Authors: Yida Yin, Zhiqiu Xu, Zhiyuan Li, trevor darrell, Zhuang Liu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical analysis finds that, for deeper neural networks, the strength of the variance reduction term in SVRG should be smaller and decrease as training progresses. Inspired by this, we introduce a multiplicative coefficient α to control the strength and adjust it through a linear decay schedule. We name our method αSVRG. Our results show α-SVRG better optimizes models, consistently reducing training loss compared to the baseline and standard SVRG across various model architectures and multiple image classification datasets. We evaluate α-SVRG on a range of architectures and image classification datasets. α-SVRG achieves a lower training loss than the baseline and the standard SVRG. Our results highlight the value of SVRG in deep learning. 5 EXPERIMENTS Table 2 presents the results of training various models on Image Net-1K. Table 3 displays the results of training Conv Ne Xt-F on various smaller datasets. |
| Researcher Affiliation | Collaboration | 1UC Berkeley 2University of Pennsylvania 3TTIC 4Meta AI Research |
| Pseudocode | Yes | The pseudocode for α-SVRG with SGD and Adam W as base optimizers is provided in Appendix G. Appendix G PSEUDOCODE FOR α-SVRG: Algorithm 1 α-SVRG with SGD, Algorithm 2 α-SVRG with Adam W |
| Open Source Code | Yes | Code is available at github.com/davidyyd/alpha-SVRG. |
| Open Datasets | Yes | We evaluate α-SVRG using Image Net-1K classification (Deng et al., 2009) as well as smaller image classification datasets: CIFAR-100 (Krizhevsky, 2009), Pets (Parkhi et al., 2012), Flowers (Nilsback & Zisserman, 2008), STL-10 (Coates et al., 2011), Food-101 (Bossard et al., 2014), DTD (Cimpoi et al., 2014), SVHN (Netzer et al., 2011), and Euro SAT (Helber et al., 2019). |
| Dataset Splits | Yes | We evaluate α-SVRG using Image Net-1K classification (Deng et al., 2009) as well as smaller image classification datasets: CIFAR-100 (Krizhevsky, 2009), Pets (Parkhi et al., 2012), Flowers (Nilsback & Zisserman, 2008), STL-10 (Coates et al., 2011), Food-101 (Bossard et al., 2014), DTD (Cimpoi et al., 2014), SVHN (Netzer et al., 2011), and Euro SAT (Helber et al., 2019). We report both final epoch training loss and top-1 validation accuracy. These are widely recognized benchmark datasets with established standard splits for training and validation. |
| Hardware Specification | No | No specific hardware details (e.g., GPU models, CPU types, memory specifications) used for running the experiments are provided in the paper. |
| Software Dependencies | No | The paper mentions using Adam W and SGD as optimizers and refers to PyTorch image models, but it does not specify software versions (e.g., Python version, PyTorch version, CUDA version) needed for replication. |
| Experiment Setup | Yes | Our basic training recipe, adapted from Conv Ne Xt (Liu et al., 2022). config value weight init trunc. normal (0.2) optimizer Adam W base learning rate 4e-3 weight decay 0.05 optimizer momentum β1, β2 = 0.9, 0.999 learning rate schedule cosine decay warmup schedule linear randaugment (Cubuk et al., 2020) (9, 0.5) mixup (Zhang et al., 2018) 0.8 cutmix (Yun et al., 2019) 1.0 random erasing (Zhong et al., 2020) 0.25 label smoothing (Szegedy et al., 2016) 0.1. Table 6 lists the batch size, warmup epochs, and training epochs for each dataset. For larger models, we adhere to the original work (Dosovitskiy et al., 2021; Liu et al., 2022), using a stochastic depth rate of 0.4 for Vi T-B and 0.5 for Conv Ne Xt-B. On small datasets, we choose the best α0 from {0.5, 0.75, 1}. For Image Net-1K, we set α0 to 0.75 for smaller models and 0.5 for larger ones. |