Gradient descent with generalized Newton’s method
Authors: Zhiqi Bu, Shiyun Xu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present extensive experiments on language and vision tasks (e.g. GPT and Res Net) to showcase that Ge N optimizers match the state-of-the-art performance, which was achieved with carefully tuned learning rate schedulers. |
| Researcher Affiliation | Collaboration | Zhiqi Bu EMAIL Shiyun Xu University of Pennsylvania EMAIL |
| Pseudocode | Yes | Algorithm 1 Generalized Newton s optimizers (Ge N), e.g. γ = 0.9, Φ = 8 |
| Open Source Code | Yes | Equal contribution. Code available at https://github.com/ShiyunXu/AutoGeN. |
| Open Datasets | Yes | We train CIFAR10 (Krizhevsky et al., 2009) on Res Net 18, 34, 50, 152 (He et al., 2016) and Vi T tiny, small, base and large (Dosovitskiy et al., 2020). For fine tuning, we use the pretrained models from the Py Torch Image Models framework (Wightman, 2019). |
| Dataset Splits | Yes | We train CIFAR10 (Krizhevsky et al., 2009) on Res Net 18, 34, 50, 152 (He et al., 2016) and Vi T tiny, small, base and large (Dosovitskiy et al., 2020)... CIFAR10 and CIFAR100 are standard tiny image datasets that we have used as the test-bed... We evaluate Ro BERTa-base (Liu et al., 2019) on the GLUE (Wang et al., 2019) benchmark with Lo RA, Bit Fit and full-parameter training (FT). |
| Hardware Specification | No | Our default setting is full-parameter training (including mixed precision training), Φ = 1, and on single GPU (no communication cost among devices). |
| Software Dependencies | No | For fine tuning, we use the pretrained models from the Py Torch Image Models framework (Wightman, 2019). ... following the official Pytorch tutorial |
| Experiment Setup | Yes | Our default hyperparameters for Figure 1, Figure 2, Figure 3, Figure 4, Figure 5 are: B = 500, Φ = 4, SGD learning rate=1e-2, Adam W learning rate=1e-4, unless one of the hyperparameters are varied for the ablation study. ... In Figure 1, Figure 2, Figure 9 and Table 3, we follow the codebase of Hu et al. and use B = 256, sequence length 128, η0 = 1e 3, and 5 epochs. While applying, we set Φ = 4. ... Batch size Initial learning rate for FT # of epochs Eval metrics MRPC 128 2e-5 10 F1 SST2 128 1e-6 10 acc. MNLI 128 1e-6 5 (1 for FT) matched acc.&mismatched acc. Co LA 128 2e-5 10 Matthews corr. QNLI 128 2e-5 10 acc. QQP 256 2e-5 5 F1 RTE 128 2e-5 60 acc. |