reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SMT: Fine-Tuning Large Language Models with Sparse Matrices

Authors: Haoze He, Juncheng Li, Xuan Jiang, Heather Miller

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we demonstrated that SMT consistently surpasses other PEFT baselines (e.g., Lo RA and Do RA) in fine-tuning popular large language models such as LLa MA across a broad spectrum of tasks, while reducing the GPU memory footprint by 67% compared to FT. We also examine how the performance of Lo RA and Do RA tends to plateau and decline as the number of trainable parameters increases, in contrast, our SMT method does not suffer from such issues.
Researcher Affiliation	Collaboration	1Carnegie Mellon University, 2Two Sigma Investments, 3University of California, Berkeley EMAIL, {xuanjiang}@berkeley.edu
Pseudocode	Yes	SMT implements a custom sparse linear layer to ensure that unselected gradients are not calculated, saved, and updated (Code Snippet 6). We replace the selected linear layers with these customized sparse linear layers. The custom sparse linear layer applies a specialized sparse linear multiplication function, integrated into our customized sparse linear layers (Code Snippet 7). This function calculates partial weight gradients based on the input, weight, and selected weight index. ... In the forward pass (Code Snippet 7) of sparse linear multiplication function, we only save selected activation x using ctx.save for backward(), and in the backward pass (Code Snippet 8), we customize matrix multiplication to calculate the needed partial gradients given partial input and gradient index (shown in Fig. 2(b)).
Open Source Code	Yes	Our implementation is open source 4. 4https://github.com/Hector HHZ/Sparse_Matrix_Tuning/
Open Datasets	Yes	In Subsection 4.3, we perform fine-tuning on Math10K (Hu et al., 2023) dataset which includes Multi Arith, GSM 8K (Cobbe et al., 2021), Add Sub, AQu A, Single Eq, SVAMP datasets and evaluate the efficiency on their testsets.
Dataset Splits	Yes	In Subsection 4.1 4.2, We perform fine-tuning on the Common Sense Reasoning tasks with 8 sub-tasks, each with a predefined training and testing set. We follow the setting of (Hu et al., 2023; Liu et al., 2024a) and amalgamate the training datasets from all 8 tasks to create the final training dataset commonsense 170k and conduct evaluations on the individual testing dataset for each task.
Hardware Specification	Yes	We conduct our experiments and implement SOTA baselines of Lo RA (Microsoft) and Do RA (Shih-yang) to fine-tune LLa MA-7B and LLa MA2-7B model with 4 NVIDIA A100 40GB GPUs and fine-tune LLa MA-13B and LLa MA3-8B model with 4 NVIDIA A100 80GB GPUs. Communication between the CPU and GPU is facilitated via PCIe-G4 and communication between GPUs is facilitated via Nvlink-3.
Software Dependencies	No	We used the Deep Speed (Aminabadi et al., 2022) library for fine-tuning and accelerate (Gugger et al., 2022) library for inference evaluation. Both training and fine-tuning are using dtype bf16. The paper mentions Deep Speed and Accelerate libraries but does not provide specific version numbers for them.
Experiment Setup	Yes	Both training and fine-tuning are using dtype bf16. All experiments are fine-tuned for 3 epochs. In all our experiments in Section 4, sub-matrices are selected in blocks of size l = 256. ... We apply 100 warm-up iterations to all SMT experiments on Commonsense dataset and apply 25 warm-up iterations to all SMT experiments on Math10K dataset. The number of warm-up iterations is fine-tuned for each dataset. ... Table 1: The experiments involved Full Fine-Tuning, SMT, Lo RA, Do RA, and Sp IEL on 4 A100 40GB GPUs using data parallel, with a batch size of 16.