SMT: Fine-Tuning Large Language Models with Sparse Matrices

Authors: Haoze He, Juncheng Li, Xuan Jiang, Heather Miller

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we demonstrated that SMT consistently surpasses other PEFT baselines (e.g., Lo RA and Do RA) in fine-tuning popular large language models such as LLa MA across a broad spectrum of tasks, while reducing the GPU memory footprint by 67% compared to FT. We also examine how the performance of Lo RA and Do RA tends to plateau and decline as the number of trainable parameters increases, in contrast, our SMT method does not suffer from such issues.
Researcher Affiliation Collaboration 1Carnegie Mellon University, 2Two Sigma Investments, 3University of California, Berkeley EMAIL, {xuanjiang}@berkeley.edu
Pseudocode Yes SMT implements a custom sparse linear layer to ensure that unselected gradients are not calculated, saved, and updated (Code Snippet 6). We replace the selected linear layers with these customized sparse linear layers. The custom sparse linear layer applies a specialized sparse linear multiplication function, integrated into our customized sparse linear layers (Code Snippet 7). This function calculates partial weight gradients based on the input, weight, and selected weight index. ... In the forward pass (Code Snippet 7) of sparse linear multiplication function, we only save selected activation x using ctx.save for backward(), and in the backward pass (Code Snippet 8), we customize matrix multiplication to calculate the needed partial gradients given partial input and gradient index (shown in Fig. 2(b)).
Open Source Code Yes Our implementation is open source 4. 4https://github.com/Hector HHZ/Sparse_Matrix_Tuning/
Open Datasets Yes In Subsection 4.3, we perform fine-tuning on Math10K (Hu et al., 2023) dataset which includes Multi Arith, GSM 8K (Cobbe et al., 2021), Add Sub, AQu A, Single Eq, SVAMP datasets and evaluate the efficiency on their testsets.
Dataset Splits Yes In Subsection 4.1 4.2, We perform fine-tuning on the Common Sense Reasoning tasks with 8 sub-tasks, each with a predefined training and testing set. We follow the setting of (Hu et al., 2023; Liu et al., 2024a) and amalgamate the training datasets from all 8 tasks to create the final training dataset commonsense 170k and conduct evaluations on the individual testing dataset for each task.
Hardware Specification Yes We conduct our experiments and implement SOTA baselines of Lo RA (Microsoft) and Do RA (Shih-yang) to fine-tune LLa MA-7B and LLa MA2-7B model with 4 NVIDIA A100 40GB GPUs and fine-tune LLa MA-13B and LLa MA3-8B model with 4 NVIDIA A100 80GB GPUs. Communication between the CPU and GPU is facilitated via PCIe-G4 and communication between GPUs is facilitated via Nvlink-3.
Software Dependencies No We used the Deep Speed (Aminabadi et al., 2022) library for fine-tuning and accelerate (Gugger et al., 2022) library for inference evaluation. Both training and fine-tuning are using dtype bf16. The paper mentions Deep Speed and Accelerate libraries but does not provide specific version numbers for them.
Experiment Setup Yes Both training and fine-tuning are using dtype bf16. All experiments are fine-tuned for 3 epochs. In all our experiments in Section 4, sub-matrices are selected in blocks of size l = 256. ... We apply 100 warm-up iterations to all SMT experiments on Commonsense dataset and apply 25 warm-up iterations to all SMT experiments on Math10K dataset. The number of warm-up iterations is fine-tuned for each dataset. ... Table 1: The experiments involved Full Fine-Tuning, SMT, Lo RA, Do RA, and Sp IEL on 4 A100 40GB GPUs using data parallel, with a batch size of 16.