Subspace Optimization for Large Language Models with Convergence Guarantees

Authors: Yutong He, Pengrui Li, Yipeng Hu, Chuyan Chen, Kun Yuan

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we empirically validate our theoretical results and thoroughly test the proposed mechanisms. Codes are available at https: //github.com/pkumelon/Golore. 6. Experiments We evaluate Ga Lore and Go Lore on several different tasks, including a counter-example problem (1), pre-training and fine-tuning LLMs with real benchmarks. Throughout our experiments, Go Lore@x% uses Ga Lore in the first (100 x)% iterations and Go Lore in the last x% iterations, L.B. Ga Lore denotes large-batch Ga Lore, and Full Params. denotes full-parameter training.
Researcher Affiliation Academia 1Peking University 2Zhongguancun Academy 3Beihang University 4AI for Science Institute, Beijing, China 5National Engineering Laboratory for Big Data Analytics and Applications. Correspondence to: Kun Yuan <EMAIL>.
Pseudocode Yes Algorithm 1 Ga Lore / Go Lore algorithm framework using stochastic / deterministic / large-batch gradients
Open Source Code Yes Codes are available at https: //github.com/pkumelon/Golore.
Open Datasets Yes We pre-trained LLa MA-60M on the C4 (Raffel et al., 2020) dataset for 30,000 iterations using various algorithms... fine-tuned pre-trained Ro BERTa models (Liu, 2019) on the GLUE benchmark (Wang, 2018)... LLa MA2-7B models (Touvron et al., 2023) on the Wino Grande dataset (Sakaguchi et al., 2021), and OPT-13B models (Zhang et al., 2022) on the Bool Q dataset (Clark et al., 2019).
Dataset Splits No Pre-training tasks on C4 dataset. We pre-trained LLa MA-60M on C4 dataset for 30,000 iterations... Fine-tuning tasks on Wino Grande dataset. We fine-tune pre-trained LLa MA2-7B model on the Wino Grande dataset for 30 epochs... Fine-tuning tasks on Bool Q dataset. We fine-tune pre-trained LLa MA2-7B model on the Bool Q dataset on 4 NVIDIA A100 80G GPUs. ... We further fine-tune pre-trained OPT-13B for 1 epoch... The text describes the duration of fine-tuning and number of iterations but does not explicitly state dataset splits (e.g., percentages for train/val/test).
Hardware Specification Yes enabling the pre-training of a 7B model on an NVIDIA RTX 4090 with 24GB of memory. Pre-training tasks on C4 dataset. We pre-trained LLa MA-60M on the C4 (Raffel et al., 2020) dataset for 30,000 iterations on 4 NVIDIA A100 40G GPUs. Fine-tuning tasks on Wino Grande dataset. ... on 4 NVIDIA A100 80G GPUs. Fine-tuning tasks on GLUE benchmark. We fine-tune pre-trained Ro BERTa-Base model on the GLUE benchmark for 30 epochs on a single Ge Force RTX 4090.
Software Dependencies No All implementations utilized the Adam W optimizer in BF16 format. We use MSGD as the subspace optimizer... The paper mentions specific optimizers and a format but does not provide specific version numbers for any software components.
Experiment Setup Yes Pre-training tasks on C4 dataset. We use batch size 128, learning rate 1.0e-3, rank 128, scaling factor α = 1, subspace changing frequency τ = 200, and a max sequence length of 256. Table 5. Hyperparameters used in fine-tuning pre-trained Ro BERTa-Base model on the GLUE benchmark.