RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models
Authors: Quan Wei, Chung-Yiu Yau, Hoi To Wai, Yang Zhao, Dongyeop Kang, Youngsuk Park, Mingyi Hong
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on Pythia, Qwen and Llama models of different sizes demonstrate the effectiveness of Ro STE. Compared to existing post-SFT quantization baselines, our method consistently achieves superior performances across various tasks and different LLM architectures. |
| Researcher Affiliation | Collaboration | 1Department of Electrical and Computer Engineering, University of Minnesota, USA. 2Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong SAR of China. 3Department of Computer Science and Engineering, University of Minnesota, USA. 4Amazon Web Services, USA. |
| Pseudocode | Yes | Algorithm 1 Ro STE Algorithm |
| Open Source Code | Yes | Our code is available at https: //github.com/Optim AI-Lab/Ro STE. |
| Open Datasets | Yes | For the first experiment (Exp.1), we fine-tune the pre-trained Pythia 1B/6.9B models (Biderman et al., 2023) and Qwen2.5 0.5B/7B models (Yang et al., 2024) on the Reddit TL;DR Summarization dataset (Huang et al., 2024) with evaluation on the TL;DR test dataset using the ROUGE metric (Lin, 2004). For the second experiment (Exp.2), we fine-tune the pre-trained Llama 3.1 8B model (Dubey et al., 2024) on the Tulu 3 SFT mixture dataset (Lambert et al., 2024) with real-world downstream task evaluations (Gao et al., 2021). These tasks include Truthful QA (Lin et al., 2021), MMLU-Pro (Wang et al., 2024b), Big Bench Hard (Suzgun et al., 2022), AGIEval (Zhong et al., 2023), GSM8K (Cobbe et al., 2021), and MATH (Hendrycks et al., 2020). |
| Dataset Splits | Yes | For the first experiment (Exp.1), we fine-tune the pre-trained Pythia 1B/6.9B models (Biderman et al., 2023) and Qwen2.5 0.5B/7B models (Yang et al., 2024) on the Reddit TL;DR Summarization dataset (Huang et al., 2024) with evaluation on the TL;DR test dataset using the ROUGE metric (Lin, 2004). For the second experiment (Exp.2), we fine-tune the pre-trained Llama 3.1 8B model (Dubey et al., 2024) on the Tulu 3 SFT mixture dataset (Lambert et al., 2024) with real-world downstream task evaluations (Gao et al., 2021). |
| Hardware Specification | Yes | All experiments are conducted on a cluster of 8 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions 'CUDA kernels' but does not specify version numbers for any software dependencies, programming languages, or libraries used in the experiments. |
| Experiment Setup | Yes | Table 4. Detailed training settings for SFT in the TL;DR summarization and Tulu 3 experiments. Table 5. Detailed training settings and hyper-parameters for QA-SFT in the TL;DR summarization and Tulu 3 experiments. |