reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models

Authors: Quan Wei, Chung-Yiu Yau, Hoi To Wai, Yang Zhao, Dongyeop Kang, Youngsuk Park, Mingyi Hong

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on Pythia, Qwen and Llama models of different sizes demonstrate the effectiveness of Ro STE. Compared to existing post-SFT quantization baselines, our method consistently achieves superior performances across various tasks and different LLM architectures.
Researcher Affiliation	Collaboration	1Department of Electrical and Computer Engineering, University of Minnesota, USA. 2Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong SAR of China. 3Department of Computer Science and Engineering, University of Minnesota, USA. 4Amazon Web Services, USA.
Pseudocode	Yes	Algorithm 1 Ro STE Algorithm
Open Source Code	Yes	Our code is available at https: //github.com/Optim AI-Lab/Ro STE.
Open Datasets	Yes	For the first experiment (Exp.1), we fine-tune the pre-trained Pythia 1B/6.9B models (Biderman et al., 2023) and Qwen2.5 0.5B/7B models (Yang et al., 2024) on the Reddit TL;DR Summarization dataset (Huang et al., 2024) with evaluation on the TL;DR test dataset using the ROUGE metric (Lin, 2004). For the second experiment (Exp.2), we fine-tune the pre-trained Llama 3.1 8B model (Dubey et al., 2024) on the Tulu 3 SFT mixture dataset (Lambert et al., 2024) with real-world downstream task evaluations (Gao et al., 2021). These tasks include Truthful QA (Lin et al., 2021), MMLU-Pro (Wang et al., 2024b), Big Bench Hard (Suzgun et al., 2022), AGIEval (Zhong et al., 2023), GSM8K (Cobbe et al., 2021), and MATH (Hendrycks et al., 2020).
Dataset Splits	Yes	For the first experiment (Exp.1), we fine-tune the pre-trained Pythia 1B/6.9B models (Biderman et al., 2023) and Qwen2.5 0.5B/7B models (Yang et al., 2024) on the Reddit TL;DR Summarization dataset (Huang et al., 2024) with evaluation on the TL;DR test dataset using the ROUGE metric (Lin, 2004). For the second experiment (Exp.2), we fine-tune the pre-trained Llama 3.1 8B model (Dubey et al., 2024) on the Tulu 3 SFT mixture dataset (Lambert et al., 2024) with real-world downstream task evaluations (Gao et al., 2021).
Hardware Specification	Yes	All experiments are conducted on a cluster of 8 NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions 'CUDA kernels' but does not specify version numbers for any software dependencies, programming languages, or libraries used in the experiments.
Experiment Setup	Yes	Table 4. Detailed training settings for SFT in the TL;DR summarization and Tulu 3 experiments. Table 5. Detailed training settings and hyper-parameters for QA-SFT in the TL;DR summarization and Tulu 3 experiments.