Towards Cost-Effective Reward Guided Text Generation
Authors: Ahmad Rashid, Ruotian Wu, Rongqi Fan, Hongliang Li, Agustinus Kristiadi, Pascal Poupart
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our reward model leads to significantly faster inference than other RGTG methods. It requires fewer calls to the reward model and performs competitively compared to previous RGTG and offline RLHF methods. Code for our work is available at https://github.com/ahmadrash/Fa RMA. We report extensive experiments with recent LLMs on various text generation tasks, demonstrating faster inference and strong alignment performance. 6. Experiments |
| Researcher Affiliation | Collaboration | 1University of Waterloo 2Vector Institute 3Huawei Technologies. |
| Pseudocode | Yes | Algorithm 1 Our Training Algorithm. Input: Base LLM to initialize the reward model Vθ, Full Sequence Preference dataset DBT = {(xk, ywk, ylk)}KBT k=1 , number of alternating iterations itern, mini-batch size n, partial sequence dataset Dmax = {(xk, yk}Kmax k=1 Output: Vθ |
| Open Source Code | Yes | Code for our work is available at https://github.com/ahmadrash/Fa RMA |
| Open Datasets | Yes | We pick the Reddit TL;DR (V olske et al., 2017) as the dataset for the summarization task. ... We use the human preference dataset from Stiennon et al. (2020a) to perform all the training and decoding. ... Next we evaluate our method on a dialogue task using the Anthropic Helpful and Harmless (HH) (Bai et al., 2022) dataset... We have additional results on text generation on the Ultra-Feedback (UF) dataset (Ganqu Cui et al., 2024) in Appendix A. |
| Dataset Splits | No | The paper mentions using well-known datasets such as Reddit TL;DR, Anthropic Helpful and Harmless (HH), and Ultra-Feedback. While these datasets are commonly used, the paper does not explicitly detail the specific training, validation, or test splits (e.g., percentages, sample counts, or explicit standard split references) used for the experiments described in this work. It only mentions evaluation on "100 samples" or "50 samples" without defining their origin relative to the full datasets. |
| Hardware Specification | Yes | All experiments are run on a server with NVIDIA A40 GPUs (40GB VRAM) and NVIDIA A100 GPUs (80GB VRAM). |
| Software Dependencies | Yes | We use CUDA Toolkit version 11.2 and Py Torch 2.5.1 framework. |
| Experiment Setup | Yes | Training details, including hyper-parameters are presented in Appendix B. Table 6. Training Hyperparameters for reward model trained mini-batch size 8000 number of alternating steps 5 LR 5e-6 Batch size 8 Gradient acc. steps 8 Deep Speed Zero stage 2 Max. sequence length 512. Tables 7 and 8 provide similar details for DPO and PPO models. |