Gradient descent with generalized Newton’s method

Authors: Zhiqi Bu, Shiyun Xu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present extensive experiments on language and vision tasks (e.g. GPT and Res Net) to showcase that Ge N optimizers match the state-of-the-art performance, which was achieved with carefully tuned learning rate schedulers.
Researcher Affiliation Collaboration Zhiqi Bu EMAIL Shiyun Xu University of Pennsylvania EMAIL
Pseudocode Yes Algorithm 1 Generalized Newton s optimizers (Ge N), e.g. γ = 0.9, Φ = 8
Open Source Code Yes Equal contribution. Code available at https://github.com/ShiyunXu/AutoGeN.
Open Datasets Yes We train CIFAR10 (Krizhevsky et al., 2009) on Res Net 18, 34, 50, 152 (He et al., 2016) and Vi T tiny, small, base and large (Dosovitskiy et al., 2020). For fine tuning, we use the pretrained models from the Py Torch Image Models framework (Wightman, 2019).
Dataset Splits Yes We train CIFAR10 (Krizhevsky et al., 2009) on Res Net 18, 34, 50, 152 (He et al., 2016) and Vi T tiny, small, base and large (Dosovitskiy et al., 2020)... CIFAR10 and CIFAR100 are standard tiny image datasets that we have used as the test-bed... We evaluate Ro BERTa-base (Liu et al., 2019) on the GLUE (Wang et al., 2019) benchmark with Lo RA, Bit Fit and full-parameter training (FT).
Hardware Specification No Our default setting is full-parameter training (including mixed precision training), Φ = 1, and on single GPU (no communication cost among devices).
Software Dependencies No For fine tuning, we use the pretrained models from the Py Torch Image Models framework (Wightman, 2019). ... following the official Pytorch tutorial
Experiment Setup Yes Our default hyperparameters for Figure 1, Figure 2, Figure 3, Figure 4, Figure 5 are: B = 500, Φ = 4, SGD learning rate=1e-2, Adam W learning rate=1e-4, unless one of the hyperparameters are varied for the ablation study. ... In Figure 1, Figure 2, Figure 9 and Table 3, we follow the codebase of Hu et al. and use B = 256, sequence length 128, η0 = 1e 3, and 5 epochs. While applying, we set Φ = 4. ... Batch size Initial learning rate for FT # of epochs Eval metrics MRPC 128 2e-5 10 F1 SST2 128 1e-6 10 acc. MNLI 128 1e-6 5 (1 for FT) matched acc.&mismatched acc. Co LA 128 2e-5 10 Matthews corr. QNLI 128 2e-5 10 acc. QQP 256 2e-5 5 F1 RTE 128 2e-5 60 acc.