Parameter and Memory Efficient Pretraining via Low-rank Riemannian Optimization

Authors: Zhanfeng Mo, Long-Kai Huang, Sinno Jialin Pan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On LLa MA-1B, LORO achieves a perplexity score of 2% better than the full-size baseline, with 54% less model memory cost and offers a 1.8 speedup in training and a 2.2 speedup in inference. The code is available on Git Hub1. ... Extensive experiments demonstrate that LORO can discover competitive low-rank models with performance comparable to full-size baselines, while providing significant memory reduction and acceleration in both training and inference.
Researcher Affiliation Collaboration Nanyang Technological University, Singapore Tencent AI Lab The Chinese University of Hong Kong EMAIL; EMAIL; EMAIL
Pseudocode Yes Algorithm 1 Low-rank Riemannian Optimizer Algorithm 2 Low-rank Riemannian Optimizer (LORO) in Py Torch (Paszke et al., 2019)
Open Source Code Yes The code is available on Git Hub1. 1https://github.com/mzf666/LORO-main
Open Datasets Yes We train all the models on the C4 (Colossal Clean Crawled Corpus) dataset (Raffel et al., 2019), a large-scale cleaned dataset designed for language models pretraining. ... Following the experiment setup in (Zhao et al., 2024, Section 5.4), we extend our LORO to finetune the pretrained Ro BERTa-base model (Liu et al., 2019) on GLUE datasets (Wang et al., 2019).
Dataset Splits Yes Following the experiment setup in (Zhao et al., 2024, Section 5.4), we extend our LORO to finetune the pretrained Ro BERTa-base model (Liu et al., 2019) on GLUE datasets (Wang et al., 2019). ... Table 8: Hyperparameters of LORO in fine-tuning Ro BERTa experiments. ... # Epochs
Hardware Specification Yes All the experiments are implemented in Py Torch (Paszke et al., 2019) and conducted on NVIDIA 40G A100 GPUs. ... We run all the experiments on 1 NVIDIA 40G A100 GPU
Software Dependencies No All the experiments are implemented in Py Torch (Paszke et al., 2019)... Adam optimizer (Kingma & Ba, 2015)... LLa MA-based language model (Touvron et al., 2023b)... Huggingface4 on the GLUE benchmark. The paper mentions software names but does not provide specific version numbers for the key libraries like PyTorch or Huggingface.
Experiment Setup Yes For all runs, we set the max data sequence length as 256 with a batch size of 512 (i.e., a token batch size of 131K). ... we employ a learning rate warmup starting from 0 during the first 10% of the pretraining steps, followed by a cosine annealing scheduler that decays to 10% of the maximum learning rate. We initialize the low-rank factors with Xavier initialization (Glorot & Bengio, 2010)... we set the LORO exact update frequency to K = 500 and the learning rate to 0.01 for the LLa MA-60M, -130M, and -350M models, while for LLa MA-1B, we set K = 200 and the learning rate to 0.005. ... Table 8: Hyperparameters of LORO in fine-tuning Ro BERTa experiments.