Towards Cost-Effective Reward Guided Text Generation

Authors: Ahmad Rashid, Ruotian Wu, Rongqi Fan, Hongliang Li, Agustinus Kristiadi, Pascal Poupart

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our reward model leads to significantly faster inference than other RGTG methods. It requires fewer calls to the reward model and performs competitively compared to previous RGTG and offline RLHF methods. Code for our work is available at https://github.com/ahmadrash/Fa RMA. We report extensive experiments with recent LLMs on various text generation tasks, demonstrating faster inference and strong alignment performance. 6. Experiments
Researcher Affiliation Collaboration 1University of Waterloo 2Vector Institute 3Huawei Technologies.
Pseudocode Yes Algorithm 1 Our Training Algorithm. Input: Base LLM to initialize the reward model Vθ, Full Sequence Preference dataset DBT = {(xk, ywk, ylk)}KBT k=1 , number of alternating iterations itern, mini-batch size n, partial sequence dataset Dmax = {(xk, yk}Kmax k=1 Output: Vθ
Open Source Code Yes Code for our work is available at https://github.com/ahmadrash/Fa RMA
Open Datasets Yes We pick the Reddit TL;DR (V olske et al., 2017) as the dataset for the summarization task. ... We use the human preference dataset from Stiennon et al. (2020a) to perform all the training and decoding. ... Next we evaluate our method on a dialogue task using the Anthropic Helpful and Harmless (HH) (Bai et al., 2022) dataset... We have additional results on text generation on the Ultra-Feedback (UF) dataset (Ganqu Cui et al., 2024) in Appendix A.
Dataset Splits No The paper mentions using well-known datasets such as Reddit TL;DR, Anthropic Helpful and Harmless (HH), and Ultra-Feedback. While these datasets are commonly used, the paper does not explicitly detail the specific training, validation, or test splits (e.g., percentages, sample counts, or explicit standard split references) used for the experiments described in this work. It only mentions evaluation on "100 samples" or "50 samples" without defining their origin relative to the full datasets.
Hardware Specification Yes All experiments are run on a server with NVIDIA A40 GPUs (40GB VRAM) and NVIDIA A100 GPUs (80GB VRAM).
Software Dependencies Yes We use CUDA Toolkit version 11.2 and Py Torch 2.5.1 framework.
Experiment Setup Yes Training details, including hyper-parameters are presented in Appendix B. Table 6. Training Hyperparameters for reward model trained mini-batch size 8000 number of alternating steps 5 LR 5e-6 Batch size 8 Gradient acc. steps 8 Deep Speed Zero stage 2 Max. sequence length 512. Tables 7 and 8 provide similar details for DPO and PPO models.