reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Cost-Effective Reward Guided Text Generation

Authors: Ahmad Rashid, Ruotian Wu, Rongqi Fan, Hongliang Li, Agustinus Kristiadi, Pascal Poupart

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, our reward model leads to significantly faster inference than other RGTG methods. It requires fewer calls to the reward model and performs competitively compared to previous RGTG and offline RLHF methods. Code for our work is available at https://github.com/ahmadrash/Fa RMA. We report extensive experiments with recent LLMs on various text generation tasks, demonstrating faster inference and strong alignment performance. 6. Experiments
Researcher Affiliation	Collaboration	1University of Waterloo 2Vector Institute 3Huawei Technologies.
Pseudocode	Yes	Algorithm 1 Our Training Algorithm. Input: Base LLM to initialize the reward model Vθ, Full Sequence Preference dataset DBT = {(xk, ywk, ylk)}KBT k=1 , number of alternating iterations itern, mini-batch size n, partial sequence dataset Dmax = {(xk, yk}Kmax k=1 Output: Vθ
Open Source Code	Yes	Code for our work is available at https://github.com/ahmadrash/Fa RMA
Open Datasets	Yes	We pick the Reddit TL;DR (V olske et al., 2017) as the dataset for the summarization task. ... We use the human preference dataset from Stiennon et al. (2020a) to perform all the training and decoding. ... Next we evaluate our method on a dialogue task using the Anthropic Helpful and Harmless (HH) (Bai et al., 2022) dataset... We have additional results on text generation on the Ultra-Feedback (UF) dataset (Ganqu Cui et al., 2024) in Appendix A.
Dataset Splits	No	The paper mentions using well-known datasets such as Reddit TL;DR, Anthropic Helpful and Harmless (HH), and Ultra-Feedback. While these datasets are commonly used, the paper does not explicitly detail the specific training, validation, or test splits (e.g., percentages, sample counts, or explicit standard split references) used for the experiments described in this work. It only mentions evaluation on "100 samples" or "50 samples" without defining their origin relative to the full datasets.
Hardware Specification	Yes	All experiments are run on a server with NVIDIA A40 GPUs (40GB VRAM) and NVIDIA A100 GPUs (80GB VRAM).
Software Dependencies	Yes	We use CUDA Toolkit version 11.2 and Py Torch 2.5.1 framework.
Experiment Setup	Yes	Training details, including hyper-parameters are presented in Appendix B. Table 6. Training Hyperparameters for reward model trained mini-batch size 8000 number of alternating steps 5 LR 5e-6 Batch size 8 Gradient acc. steps 8 Deep Speed Zero stage 2 Max. sequence length 512. Tables 7 and 8 provide similar details for DPO and PPO models.