reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CLoQ: Enhancing Fine-Tuning of Quantized LLMs via Calibrated LoRA Initialization

Authors: Yanxia Deng, Aozhong Zhang, Selcuk Gurses, Naigang Wang, Zi Yang, Penghang Yin

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we introduce CLo Q (Calibrated Lo RA initialization for Quantized LLMs), a simplistic initialization strategy designed to overcome these challenges. Our approach focuses on minimizing the layer-wise discrepancy between the original LLM and its quantized counterpart with Lo RA components during initialization. By leveraging a small calibration dataset, CLo Q quantizes a pre-trained LLM and determines the optimal Lo RA components for each layer, ensuring a strong foundation for subsequent fine-tuning. ... We validate the efficacy of CLo Q across multiple tasks such as language generation, arithmetic reasoning, and commonsense reasoning, demonstrating that it consistently outperforms existing Lo RA fine-tuning methods for quantized LLMs, especially at 2-bit.
Researcher Affiliation	Collaboration	Yanxia Deng EMAIL Department of Mathematics and Statistics University at Albany, SUNY; Aozhong Zhang EMAIL Department of Mathematics and Statistics University at Albany, SUNY; Selcuk Gurses EMAIL Department of Mathematics and Statistics University at Albany, SUNY; Naigang Wang EMAIL IBM T. J. Watson Research Center; Zi Yang EMAIL Department of Mathematics and Statistics University at Albany, SUNY; Penghang Yin EMAIL Department of Mathematics and Statistics University at Albany, SUNY
Pseudocode	Yes	Algorithm 1 CLo Q for initializing one linear layer
Open Source Code	Yes	The code is available at https://github.com/Aozhong Zhang/CLo Q
Open Datasets	Yes	We test CLo Q on Llama2-7b, Llama2-13b Touvron et al. (2023), Llama3-8b Grattafiori et al. (2024) and Mistral-7b-v0.1 Jiang et al. (2023) models. Following prior works Frantar et al. (2022a), we randomly sample 128 instances, each with a context length of 2048 tokens, from the Wiki Text-2 dataset Merity et al. (2016) to serve as the calibration set for quantization. Then, we fine-tune and evaluate the models on Wiki Text-2 for language modeling. For single arithmetic reasoning tasks, we fine-tune and evaluate on the GSM8K Cobbe et al. (2021). For multi arithmetic reasoning, we fine-tune the models on Math10K Hu et al. (2023) and then evaluate the test sets of AQu A Ling et al. (2017), GSM8K, MAWPS Koncel-Kedziorski et al. (2016) and SVAMP Patel et al. (2021). For commonsense reasoning tasks, we fine-tune the models on Commonsense170K Hu et al. (2023) and evaluate on eight representative tasks: Bool Q Clark et al. (2019), PIQA Bisk et al. (2020), SIQA Sap et al. (2019), Hella Swag Zellers et al. (2019), Wino Grande Sakaguchi et al. (2021), ARC-e, ARC-c Clark et al. (2018) and OBQA Mihaylov et al. (2018).
Dataset Splits	Yes	We randomly sample 128 instances, each with a context length of 2048 tokens, from the Wiki Text-2 dataset Merity et al. (2016) to serve as the calibration set for quantization. Then, we fine-tune and evaluate the models on Wiki Text-2 for language modeling. For single arithmetic reasoning tasks, we fine-tune and evaluate on the GSM8K Cobbe et al. (2021). For multi arithmetic reasoning, we fine-tune the models on Math10K Hu et al. (2023) and then evaluate the test sets of AQu A Ling et al. (2017), GSM8K, MAWPS Koncel-Kedziorski et al. (2016) and SVAMP Patel et al. (2021). For commonsense reasoning tasks, we fine-tune the models on Commonsense170K Hu et al. (2023) and evaluate on eight representative tasks... Appendix A.1 Language modeling: To study the capability of CLo Q, we fine-tune quantized models on the Wiki Text-2 training set and measure perplexity on the validation set. Appendix A.2 Arithmetic reasoning: To assess CLo Q s arithmetic reasoning capability, we fine-tune quantized models using the GSM8K training set and evaluate their accuracy on the test set.
Hardware Specification	Yes	All experiments are conducted on NVIDIA A100 GPUs with 80GB of memory.
Software Dependencies	No	The paper mentions using Adam W (Loshchilov (2017)) as an optimizer but does not specify versions for any other key software components, libraries, or programming languages.
Experiment Setup	Yes	The detailed hyperparameter settings for all our experiments are presented in the Appendix A. ... Table 11: Hyper-parameter for the finetuning of Llama2. Table 12: Best learning rate for Llama2-7B and Llama2-13B on the Wiki Text-2, GSM8K, and multiple Arithmetic Reasoning tasks.