AdaGrad under Anisotropic Smoothness

Authors: Yuxing Liu, Rui Pan, Tong Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments in logistic regression and instruction following fine-tuning tasks provide strong evidence to support our novel assumption and theoretical analysis. ... 6 EXPERIMENTAL RESULTS ... We utilize real-world datasets a4a, a6a, a9a, real-sim and rcv1.binary from libsvm (Chang & Lin, 2011) ... For nonconvex cases, we check the instructionfollowing fine-tuning task on Alpaca (Taori et al., 2023) dataset with GPT-2 (Radford et al., 2019) model.
Researcher Affiliation Academia Yuxing Liu Rui Pan Tong Zhang University of Illinois Urbana-Champaign EMAIL
Pseudocode Yes Algorithm 1 Adagrad
Open Source Code No The paper discusses the use of a third-party library: 'In all our implementations, we use the version transformers==4.38.2.' and mentions its license. However, there is no explicit statement or link indicating that the authors' own implementation code for the methodology described in this paper is open-sourced.
Open Datasets Yes We utilize real-world datasets a4a, a6a, a9a, real-sim and rcv1.binary from libsvm (Chang & Lin, 2011) ... For nonconvex cases, we check the instructionfollowing fine-tuning task on Alpaca (Taori et al., 2023) dataset with GPT-2 (Radford et al., 2019) model. ... Regarding licenses, the Alpaca dataset is released under Creative Commons Attribution Non Commercial 4.0 International Public License (https://github.com/tatsu-lab/ stanford_alpaca/blob/main/DATA_LICENSE)
Dataset Splits No The paper mentions using specific datasets (libsvm datasets, Alpaca) and refers to 'fine-tuning tasks', but it does not explicitly provide details about the specific training, validation, or test splits used for its experiments. For example, it does not specify percentages or sample counts for how these datasets were divided for their experiments.
Hardware Specification Yes All experiments are conducted on a single A40 GPU, where gradient accumulation is adopted for batch sizes larger than 128 to reduce memory cost.
Software Dependencies Yes In all our implementations, we use the version transformers==4.38.2.
Experiment Setup Yes grid searches are conducted for both algorithms, with the search space being initial learning rate η {10.0, 1.0, 0.1, 0.01} and learning rate schedules being either constant ηt η or inverse square root decay ηt = η/ t + 1 ... For all experiments, we run 3 epochs of optimization with SGD and Adagrad... We search the learning rate η {1.0, 10 1, 10 2, 10 3, 10 4, 10 5, 10 6} ... The maximum sequence length is set to 512, along with the learning rate schedule being set to cosine decay (Loshchilov & Hutter, 2016).