reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Streamlining Language Models via Semantic Basis Analysis

Authors: Yang Li, Daniel Agyei Asante, Changsheng Zhao, Ernie Chang, Yangyang Shi, Vikas Chandra

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Basel across multiple settings. First, for mathematical reasoning and code generation, we compress Llama 2-7B and Llama 2-13B with Basel and measure pass@1 accuracy on GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) as well as on Human Eval (Chen et al., 2021a) and MBPP (Austin et al., 2021). Experimental results demonstrate that Basel achieves significant model size reduction compared to baseline techniques, while maintaining comparable or even superior accuracy across diverse applications.
Researcher Affiliation	Collaboration	Yang Li* EMAIL Iowa State University and Meta Daniel Agyei Asante* EMAIL Iowa State University Changsheng Zhao EMAIL Meta Ernie Chang EMAIL Meta Yangyang Shi EMAIL Meta Vikas Chandra EMAIL Meta
Pseudocode	Yes	Algorithm 1: Basel Algorithm Input: Pretrained or Finetuning Model M Output: Compressed Model M Data: Hyperparameters including Keep Ratio, Pruning Times, Keeping Epoch, Pruning Epoch, Post Fine Tuning Epoch, r
Open Source Code	Yes	A preprint of this work is available on ar Xiv (Li et al., 2024a), and the source code of the work is publicly available at https://github.com/Iowa-State-University-AI-System-Group/Basel .
Open Datasets	Yes	For the mathematical reasoning task, we utilize two evaluation datasets: GSM8K (Cobbe et al., 2021) and Hendrycks MATH (Hendrycks et al., 2021). For the code generation task, we use two evaluation datasets: MBPP (Austin et al., 2021) and Human Eval (Chen et al., 2021a). For the language modeling task, we evaluate on Wiki Text-2 (Merity et al., 2016).
Dataset Splits	No	The paper states it uses "the training set of the target application" for retraining singular values and mentions evaluation datasets (e.g., GSM8K, MATH) for measurement, but does not explicitly provide the specific training/test/validation split ratios or sample counts for these datasets needed for reproduction of the data partitioning.
Hardware Specification	Yes	Table 2: GPU hours and GPU memory consumption of Basel versus full fine-tuning on Llama 2-7B using NVIDIA L40S GPUs (batch size = 32, max sequence length = 512). Figure 12 presents the inference throughput and memory consumption of models compressed from Llama 2-7B on a single A100 GPU, using GSM8K as the evaluation set.
Software Dependencies	No	The paper describes the proposed method Basel and compares it with other compression algorithms and models (e.g., SVD, FWSVD, QLoRA, FLAP, Wanda, Llama 2-7B), but does not explicitly list any specific software dependencies or their version numbers (e.g., Python, PyTorch, CUDA versions) used for implementation or experimentation.
Experiment Setup	Yes	Our implementation of Basel is configured with the following key hyperparameters: Keep Ratio varies from 70% to 5%, Pruning Times = 100, Keeping Epoch = 1, Pruning Epoch = 2 (for math reasoning and code generation) and 1 (for language modeling), Post Fine Tuning Epoch = 3, and r = 32 (see Algorithm 1 for further details). Table 2: GPU hours and GPU memory consumption of Basel versus full fine-tuning on Llama 2-7B using NVIDIA L40S GPUs (batch size = 32, max sequence length = 512).