AdaGrad under Anisotropic Smoothness
Authors: Yuxing Liu, Rui Pan, Tong Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments in logistic regression and instruction following fine-tuning tasks provide strong evidence to support our novel assumption and theoretical analysis. ... 6 EXPERIMENTAL RESULTS ... We utilize real-world datasets a4a, a6a, a9a, real-sim and rcv1.binary from libsvm (Chang & Lin, 2011) ... For nonconvex cases, we check the instructionfollowing fine-tuning task on Alpaca (Taori et al., 2023) dataset with GPT-2 (Radford et al., 2019) model. |
| Researcher Affiliation | Academia | Yuxing Liu Rui Pan Tong Zhang University of Illinois Urbana-Champaign EMAIL |
| Pseudocode | Yes | Algorithm 1 Adagrad |
| Open Source Code | No | The paper discusses the use of a third-party library: 'In all our implementations, we use the version transformers==4.38.2.' and mentions its license. However, there is no explicit statement or link indicating that the authors' own implementation code for the methodology described in this paper is open-sourced. |
| Open Datasets | Yes | We utilize real-world datasets a4a, a6a, a9a, real-sim and rcv1.binary from libsvm (Chang & Lin, 2011) ... For nonconvex cases, we check the instructionfollowing fine-tuning task on Alpaca (Taori et al., 2023) dataset with GPT-2 (Radford et al., 2019) model. ... Regarding licenses, the Alpaca dataset is released under Creative Commons Attribution Non Commercial 4.0 International Public License (https://github.com/tatsu-lab/ stanford_alpaca/blob/main/DATA_LICENSE) |
| Dataset Splits | No | The paper mentions using specific datasets (libsvm datasets, Alpaca) and refers to 'fine-tuning tasks', but it does not explicitly provide details about the specific training, validation, or test splits used for its experiments. For example, it does not specify percentages or sample counts for how these datasets were divided for their experiments. |
| Hardware Specification | Yes | All experiments are conducted on a single A40 GPU, where gradient accumulation is adopted for batch sizes larger than 128 to reduce memory cost. |
| Software Dependencies | Yes | In all our implementations, we use the version transformers==4.38.2. |
| Experiment Setup | Yes | grid searches are conducted for both algorithms, with the search space being initial learning rate η {10.0, 1.0, 0.1, 0.01} and learning rate schedules being either constant ηt η or inverse square root decay ηt = η/ t + 1 ... For all experiments, we run 3 epochs of optimization with SGD and Adagrad... We search the learning rate η {1.0, 10 1, 10 2, 10 3, 10 4, 10 5, 10 6} ... The maximum sequence length is set to 512, along with the learning rate schedule being set to cosine decay (Loshchilov & Hutter, 2016). |