reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Low-Rank Parametrization of Reward Models for Controlled Language Generation

Authors: Sergey Troshin, Vlad Niculae, Antske Fokkens

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically show that the low-rank RAD performs on par with the more flexible RAD on a detoxification and a sentiment control task, while requiring only a single reward model call per generated token.
Researcher Affiliation	Academia	Sergey Troshin EMAIL Language Technology Lab, Informatics Institute, University of Amsterdam Vlad Niculae EMAIL Language Technology Lab, Informatics Institute, University of Amsterdam Antske Fokkens EMAIL Computational Linguistics and Text Mining Lab, Faculty of Social Sciences and Humanities, Vrije Universiteit Amsterdam
Pseudocode	No	The paper describes methods using mathematical equations and textual descriptions but does not contain any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	1Our code is available at https://github.com/serjtroshin/rad-q
Open Datasets	Yes	For the detoxification evaluation, we follow previous work (Deng & Raffel, 2023; Liu et al., 2021) and evaluate samples from guided decoding given a 10k subset (Liu et al., 2021) of prompts from the Real Toxicity Prompts dataset (Gehman et al., 2020). We follow Deng & Raffel (2023) and Liu et al. (2021) and finetune our model on 2M pairs of text and continuous toxicity responses between 0 and 1 from the Jigsaw Unintended Bias in Toxicity Classification challenge (cjadams et al., 2019). [...] To finetune RAD-Q on responses only, we follow Deng & Raffel (2023) and finetune our model on millions of reviews from the Amazon Polarity (Zhang et al., 2015) and SST-2 (Socher et al., 2013) datasets.
Dataset Splits	No	The paper mentions using a '10k subset' of prompts for detoxification evaluation and '2M pairs' for training, and similarly for sentiment control, refers to '2.5K negative, 5K neutral, and 2.5K positive prompts' for evaluation and 'millions of reviews' for training. However, it does not provide specific, reproducible details about how these datasets were split into training, validation, or test sets for the authors' experiments (e.g., specific percentages, sample counts for each split used in their models, or explicit references to standard splits for their model training/evaluation process).
Hardware Specification	Yes	In Figure 8, we measure the time per generated token when running the decoding for the toxicity task with RAD-Q and RAD-V (Deng & Raffel, 2023) on a single RTX A6000 GPU.
Software Dependencies	No	The paper mentions using 'Adam optimizer (Kingma & Ba, 2015)' and refers to libraries like 'Numpy' and 'PyTorch' in the context of numerical rank calculation, but does not provide specific version numbers for these software components or any other key libraries used in their implementation, which is necessary for reproducibility.
Experiment Setup	Yes	To train reward models, we reuse the hyperparameters from Deng & Raffel (2023), where possible. We finetune the reward models with Adam optimizer (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.95, ϵ = 1e 12. We use weight decay 0.01, batch size 100, and the learning rate changes linearly from the initial value (10 5 by default) to zero. [...] For the detoxification task, we finetune RAD-Q with the learning rate 10 5 for 5 epochs. [...] To finetune RAD-Q on responses only for sentiment control task, we first finetune the model with the learning rate 10 5 on the Amazon Polarity dataset, and then finetune it for 5 epochs on the SST-2 dataset with the learning rate 2e 6. For distillation experiment, we finetune RAD-Q for 5 epochs with the learning rate 10 5 on Amazon Polarity dataset.