GGD: Grafting Gradient Descent

Authors: Yanjing Feng, Yongdao Zhou

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The real data studies also show that GGD achieves an intermediate performance among SGD with importance sampling and mini-batch SGD, and outperforms original SGD method. Then the proposed GGD is a better and more robust stochastic optimization framework in practice. [...] Our empirical results are presented in this section. We evaluate the performance of grafting gradient based algorithms on solving strongly-convex and non-convex problems, and compare their performance with vanilla SGD, SGD with importance sampling, mini-batch SGD, variance reduction method SVRG and adaptive stepsize method Adam. We first run experiments on the L2-regularized logistic regression problem [...] We then run experiments to solve the multiclass classification problems via training convolution neural networks [...] Experiment results are presented in Figures 10 and 11.
Researcher Affiliation Academia Yanjing Feng EMAIL NITFID, School of Statistics and Data science Nankai University Tianjin 300071, China Yongdao Zhou EMAIL NITFID, School of Statistics and Data science Nankai University Tianjin 300071, China
Pseudocode Yes Algorithm 1: Grafting Gradient Descent [...] Algorithm 2: GGD-WR-SVRG method [...] Algorithm 3: GGD-WR-Adam method
Open Source Code Yes All the codes are available at https://github.com/oo0mmmm/GGD.
Open Datasets Yes All data sets are available on http://www.csie.ntu.edu.tw/~cjlin/libsvm, and are widely used in other literatures (Nguyen et al., 2017; Qian et al., 2019; Sebbouh et al., 2019; Mishchenko et al., 2020; Huang et al., 2021; Malinovsky et al., 2021). [...] We use two common data sets, MNIST (Le Cun et al., 2010) and CIFAR-10 (Krizhevsky et al., 2009) to train two convolution neural networks with different structures. [...] Online News Popularity (ONR for short) data set (Fernandes and Sernadela, 2015)
Dataset Splits Yes For ijcnn1, a9a and rcv1, we use the predefined training set and testing set. covtype does not have a testing set. In that case, we randomly split the data set into the training set and the testing set with 50% for training and 50% for testing. [...] Since ONR does not have a testing set, we randomly split it into the training set and testing set with 80% for training and 20% for testing.
Hardware Specification Yes All the experiments are carried out on a personal computer (3.70 GHz 12th Gen Intel Core i5 with 16 GB RAM and NVIDIA RTX 3080).
Software Dependencies No The paper mentions 'most machine learning libraries calculate partial derivatives through backpropagation and the chain rule', but does not specify any particular software, frameworks, or libraries with version numbers. Therefore, there is no reproducible description of ancillary software.
Experiment Setup Yes The penalty parameter λ is set to be 1/ntr for all the experiments on different data sets. [...] we adopt the popular t-inverse learning schedule γk = γ0(1 + γd k/ntr ) 1 [...] For the methods which use grafting gradients to update the parameters in one iteration, unless specifying, the size of subsampled set m is set to be 16 for ijcnn1, a9a and rcv1 and 256 for covtype [...] The batch size b used in the grafting gradient based methods is set to be 2 [...] Good default settings for the hyperparameters are b = 2, m = 2k for k N +, γ = 0.001, β1 = 0.9, β2 = 0.999 and σ = 10 8 [...] L2-regularization are used for preventing overfitting in these experiments and the penalty parameter λ is set to be 10 4. Features in data sets are normalized to the interval [0, 1]. [...] The final learning rate is 10 5 and the initial learning rate is 10 2 for MBSGD and GGD. The final learning rate is 10 6 and the initial learning rate is 10 4 for Adam and GAdam. As suggested in Section 5, the final learning rate is 1/n2/3 tr and the initial learning rate is 100/n2/3 tr for SVRG and GSVRG methods. The batch size b for grafting gradient based methods is set to be 2 and the update period for variance reduction methods q are set to be 3ntr/m.