Streamlining Language Models via Semantic Basis Analysis

Authors: Yang Li, Daniel Agyei Asante, Changsheng Zhao, Ernie Chang, Yangyang Shi, Vikas Chandra

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Basel across multiple settings. First, for mathematical reasoning and code generation, we compress Llama 2-7B and Llama 2-13B with Basel and measure pass@1 accuracy on GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) as well as on Human Eval (Chen et al., 2021a) and MBPP (Austin et al., 2021). Experimental results demonstrate that Basel achieves significant model size reduction compared to baseline techniques, while maintaining comparable or even superior accuracy across diverse applications.
Researcher Affiliation Collaboration Yang Li* EMAIL Iowa State University and Meta Daniel Agyei Asante* EMAIL Iowa State University Changsheng Zhao EMAIL Meta Ernie Chang EMAIL Meta Yangyang Shi EMAIL Meta Vikas Chandra EMAIL Meta
Pseudocode Yes Algorithm 1: Basel Algorithm Input: Pretrained or Finetuning Model M Output: Compressed Model M Data: Hyperparameters including Keep Ratio, Pruning Times, Keeping Epoch, Pruning Epoch, Post Fine Tuning Epoch, r
Open Source Code Yes A preprint of this work is available on ar Xiv (Li et al., 2024a), and the source code of the work is publicly available at https://github.com/Iowa-State-University-AI-System-Group/Basel .
Open Datasets Yes For the mathematical reasoning task, we utilize two evaluation datasets: GSM8K (Cobbe et al., 2021) and Hendrycks MATH (Hendrycks et al., 2021). For the code generation task, we use two evaluation datasets: MBPP (Austin et al., 2021) and Human Eval (Chen et al., 2021a). For the language modeling task, we evaluate on Wiki Text-2 (Merity et al., 2016).
Dataset Splits No The paper states it uses "the training set of the target application" for retraining singular values and mentions evaluation datasets (e.g., GSM8K, MATH) for measurement, but does not explicitly provide the specific training/test/validation split ratios or sample counts for these datasets needed for reproduction of the data partitioning.
Hardware Specification Yes Table 2: GPU hours and GPU memory consumption of Basel versus full fine-tuning on Llama 2-7B using NVIDIA L40S GPUs (batch size = 32, max sequence length = 512). Figure 12 presents the inference throughput and memory consumption of models compressed from Llama 2-7B on a single A100 GPU, using GSM8K as the evaluation set.
Software Dependencies No The paper describes the proposed method Basel and compares it with other compression algorithms and models (e.g., SVD, FWSVD, QLoRA, FLAP, Wanda, Llama 2-7B), but does not explicitly list any specific software dependencies or their version numbers (e.g., Python, PyTorch, CUDA versions) used for implementation or experimentation.
Experiment Setup Yes Our implementation of Basel is configured with the following key hyperparameters: Keep Ratio varies from 70% to 5%, Pruning Times = 100, Keeping Epoch = 1, Pruning Epoch = 2 (for math reasoning and code generation) and 1 (for language modeling), Post Fine Tuning Epoch = 3, and r = 32 (see Algorithm 1 for further details). Table 2: GPU hours and GPU memory consumption of Basel versus full fine-tuning on Llama 2-7B using NVIDIA L40S GPUs (batch size = 32, max sequence length = 512).