Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens

Authors: Zhepeng Cen, Yao Liu, Siliang Zeng, Pratik Chaudhari, Huzefa Rangwala, George Karypis, Rasool Fakoor

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By incorporating our proposed strategies during training, we have observed an overall improvement in performance compared to baseline methods, as demonstrated by our extensive experiments using summarization, general question-answering, and math question-answering tasks. In this section, we present a comprehensive empirical comparison of our proposed methods across a range of standard benchmark tasks, including summarization and question answering (QA).
Researcher Affiliation Collaboration Zhepeng Cen EMAIL Carnegie Mellon University Yao Liu EMAIL Amazon Web Services Siliang Zeng EMAIL University of Minnesota Twin Cities Pratik Chaudhari EMAIL Amazon Web Services Huzefa Rangwala EMAIL Amazon Web Services George Karypis EMAIL Amazon Web Services Rasool Fakoor EMAIL Amazon Web Services
Pseudocode Yes Algorithm 1 Batch-scheduled Sampling (BASH) Input: Pre-trained model ω, training datatset D. Algorithm 2 Reference-Answer-based Correction (RAC) Input: Pre-trained model ω, training dataset D.
Open Source Code No We implement baselines and our methods based on two codebases: summarize-from-feedback11 (for summarization task) and Alignment-Handbook12 (for general QA and math QA tasks), which use Deep Speed Ze RO (Rajbhandari et al., 2020) for higher training efficiency and less computation overhead. In summarization task, we generate the response from the fine-tuned model with temperature=0.01 for win rate evaluation following the codebase used. 11We implement algorithms based on Open AI summarize-from-feedback and its clean-up version. 12See link of alignment-handbook. 13See link of SPIN. The paper does not provide its own specific code release.
Open Datasets Yes We use Open AI TL;DR dataset (Stiennon et al., 2020) for this task... For this task, we use the Ultrachat-200K dataset, a high-quality 200K subset of the Ultrachat corpus (Ding et al., 2023)... we use two commonly used math QA datasets: GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021)... we use the preprocessed Ultra Feedback dataset (Cui et al., 2024) as the preference data.
Dataset Splits Yes Following SPIN (Ouyang et al., 2022), we randomly sample 50K prompts from the full training set to generate offline datasets Ds and Dr for our methods and then use these datasets to train our BASH and RAC as explained in Sec. 3. We use Open AI TL;DR dataset (Stiennon et al., 2020) for this task... We evaluate performance by calculating the win rate against the reference summary and reporting Rouge F1 scores (Lin, 2004) on its test set. Math QA task. ... and evaluate the accuracy on their respective test sets. Specifically, we randomly choose a subset of training data with proportion 10%, 20%, 50%, 100%.
Hardware Specification Yes Computation overhead comparison between SCS and BASH with pythia-1B model on the summarization task. We compare the performance on an 8x A6000 (48G) machine. Computation overhead comparison between SCS and BASH with Mistral-7B model on the general QA task. We compare the performance on an 8x A6000 (48G) machine.
Software Dependencies No We implement baselines and our methods based on two codebases: summarize-from-feedback11 (for summarization task) and Alignment-Handbook12 (for general QA and math QA tasks), which use Deep Speed Ze RO (Rajbhandari et al., 2020) for higher training efficiency and less computation overhead. This mentions "Deep Speed Ze RO" but without a version number, and no other software versions are specified.
Experiment Setup Yes Table 5: Hyper-parameters used for experiments. Summarization General QA Math QA base pretrained model pythia-1B Mistral-7B-v0.1 Mistral-7B-v0.1 precision bfloat16 bfloat16 bfloat16 optimizer Adam W Adam W Adam W learning rate 3 10 6 5 10 6 5 10 6 learning rate warmup steps no warmup 10% 10% learning scheduler cosine cosine cosine global batch size 512 512 512 SFT training epoch 2 / 1 training iteration (BASH&RAC) 1 2 1 training epochs in each iteration (BASH&RAC) 1 [1, 2] 1 mixture coefficient β in BASH generation 0.2 0.2 0.2