reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens

Authors: Zhepeng Cen, Yao Liu, Siliang Zeng, Pratik Chaudhari, Huzefa Rangwala, George Karypis, Rasool Fakoor

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By incorporating our proposed strategies during training, we have observed an overall improvement in performance compared to baseline methods, as demonstrated by our extensive experiments using summarization, general question-answering, and math question-answering tasks. In this section, we present a comprehensive empirical comparison of our proposed methods across a range of standard benchmark tasks, including summarization and question answering (QA).
Researcher Affiliation	Collaboration	Zhepeng Cen EMAIL Carnegie Mellon University Yao Liu EMAIL Amazon Web Services Siliang Zeng EMAIL University of Minnesota Twin Cities Pratik Chaudhari EMAIL Amazon Web Services Huzefa Rangwala EMAIL Amazon Web Services George Karypis EMAIL Amazon Web Services Rasool Fakoor EMAIL Amazon Web Services
Pseudocode	Yes	Algorithm 1 Batch-scheduled Sampling (BASH) Input: Pre-trained model ω, training datatset D. Algorithm 2 Reference-Answer-based Correction (RAC) Input: Pre-trained model ω, training dataset D.
Open Source Code	No	We implement baselines and our methods based on two codebases: summarize-from-feedback11 (for summarization task) and Alignment-Handbook12 (for general QA and math QA tasks), which use Deep Speed Ze RO (Rajbhandari et al., 2020) for higher training efficiency and less computation overhead. In summarization task, we generate the response from the fine-tuned model with temperature=0.01 for win rate evaluation following the codebase used. 11We implement algorithms based on Open AI summarize-from-feedback and its clean-up version. 12See link of alignment-handbook. 13See link of SPIN. The paper does not provide its own specific code release.
Open Datasets	Yes	We use Open AI TL;DR dataset (Stiennon et al., 2020) for this task... For this task, we use the Ultrachat-200K dataset, a high-quality 200K subset of the Ultrachat corpus (Ding et al., 2023)... we use two commonly used math QA datasets: GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021)... we use the preprocessed Ultra Feedback dataset (Cui et al., 2024) as the preference data.
Dataset Splits	Yes	Following SPIN (Ouyang et al., 2022), we randomly sample 50K prompts from the full training set to generate offline datasets Ds and Dr for our methods and then use these datasets to train our BASH and RAC as explained in Sec. 3. We use Open AI TL;DR dataset (Stiennon et al., 2020) for this task... We evaluate performance by calculating the win rate against the reference summary and reporting Rouge F1 scores (Lin, 2004) on its test set. Math QA task. ... and evaluate the accuracy on their respective test sets. Specifically, we randomly choose a subset of training data with proportion 10%, 20%, 50%, 100%.
Hardware Specification	Yes	Computation overhead comparison between SCS and BASH with pythia-1B model on the summarization task. We compare the performance on an 8x A6000 (48G) machine. Computation overhead comparison between SCS and BASH with Mistral-7B model on the general QA task. We compare the performance on an 8x A6000 (48G) machine.
Software Dependencies	No	We implement baselines and our methods based on two codebases: summarize-from-feedback11 (for summarization task) and Alignment-Handbook12 (for general QA and math QA tasks), which use Deep Speed Ze RO (Rajbhandari et al., 2020) for higher training efficiency and less computation overhead. This mentions "Deep Speed Ze RO" but without a version number, and no other software versions are specified.
Experiment Setup	Yes	Table 5: Hyper-parameters used for experiments. Summarization General QA Math QA base pretrained model pythia-1B Mistral-7B-v0.1 Mistral-7B-v0.1 precision bfloat16 bfloat16 bfloat16 optimizer Adam W Adam W Adam W learning rate 3 10 6 5 10 6 5 10 6 learning rate warmup steps no warmup 10% 10% learning scheduler cosine cosine cosine global batch size 512 512 512 SFT training epoch 2 / 1 training iteration (BASH&RAC) 1 2 1 training epochs in each iteration (BASH&RAC) 1 [1, 2] 1 mixture coefficient β in BASH generation 0.2 0.2 0.2