Subspace Optimization for Large Language Models with Convergence Guarantees
Authors: Yutong He, Pengrui Li, Yipeng Hu, Chuyan Chen, Kun Yuan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we empirically validate our theoretical results and thoroughly test the proposed mechanisms. Codes are available at https: //github.com/pkumelon/Golore. 6. Experiments We evaluate Ga Lore and Go Lore on several different tasks, including a counter-example problem (1), pre-training and fine-tuning LLMs with real benchmarks. Throughout our experiments, Go Lore@x% uses Ga Lore in the first (100 x)% iterations and Go Lore in the last x% iterations, L.B. Ga Lore denotes large-batch Ga Lore, and Full Params. denotes full-parameter training. |
| Researcher Affiliation | Academia | 1Peking University 2Zhongguancun Academy 3Beihang University 4AI for Science Institute, Beijing, China 5National Engineering Laboratory for Big Data Analytics and Applications. Correspondence to: Kun Yuan <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Ga Lore / Go Lore algorithm framework using stochastic / deterministic / large-batch gradients |
| Open Source Code | Yes | Codes are available at https: //github.com/pkumelon/Golore. |
| Open Datasets | Yes | We pre-trained LLa MA-60M on the C4 (Raffel et al., 2020) dataset for 30,000 iterations using various algorithms... fine-tuned pre-trained Ro BERTa models (Liu, 2019) on the GLUE benchmark (Wang, 2018)... LLa MA2-7B models (Touvron et al., 2023) on the Wino Grande dataset (Sakaguchi et al., 2021), and OPT-13B models (Zhang et al., 2022) on the Bool Q dataset (Clark et al., 2019). |
| Dataset Splits | No | Pre-training tasks on C4 dataset. We pre-trained LLa MA-60M on C4 dataset for 30,000 iterations... Fine-tuning tasks on Wino Grande dataset. We fine-tune pre-trained LLa MA2-7B model on the Wino Grande dataset for 30 epochs... Fine-tuning tasks on Bool Q dataset. We fine-tune pre-trained LLa MA2-7B model on the Bool Q dataset on 4 NVIDIA A100 80G GPUs. ... We further fine-tune pre-trained OPT-13B for 1 epoch... The text describes the duration of fine-tuning and number of iterations but does not explicitly state dataset splits (e.g., percentages for train/val/test). |
| Hardware Specification | Yes | enabling the pre-training of a 7B model on an NVIDIA RTX 4090 with 24GB of memory. Pre-training tasks on C4 dataset. We pre-trained LLa MA-60M on the C4 (Raffel et al., 2020) dataset for 30,000 iterations on 4 NVIDIA A100 40G GPUs. Fine-tuning tasks on Wino Grande dataset. ... on 4 NVIDIA A100 80G GPUs. Fine-tuning tasks on GLUE benchmark. We fine-tune pre-trained Ro BERTa-Base model on the GLUE benchmark for 30 epochs on a single Ge Force RTX 4090. |
| Software Dependencies | No | All implementations utilized the Adam W optimizer in BF16 format. We use MSGD as the subspace optimizer... The paper mentions specific optimizers and a format but does not provide specific version numbers for any software components. |
| Experiment Setup | Yes | Pre-training tasks on C4 dataset. We use batch size 128, learning rate 1.0e-3, rank 128, scaling factor α = 1, subspace changing frequency τ = 200, and a max sequence length of 256. Table 5. Hyperparameters used in fine-tuning pre-trained Ro BERTa-Base model on the GLUE benchmark. |