On the Low-Rank Parametrization of Reward Models for Controlled Language Generation
Authors: Sergey Troshin, Vlad Niculae, Antske Fokkens
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that the low-rank RAD performs on par with the more flexible RAD on a detoxification and a sentiment control task, while requiring only a single reward model call per generated token. |
| Researcher Affiliation | Academia | Sergey Troshin EMAIL Language Technology Lab, Informatics Institute, University of Amsterdam Vlad Niculae EMAIL Language Technology Lab, Informatics Institute, University of Amsterdam Antske Fokkens EMAIL Computational Linguistics and Text Mining Lab, Faculty of Social Sciences and Humanities, Vrije Universiteit Amsterdam |
| Pseudocode | No | The paper describes methods using mathematical equations and textual descriptions but does not contain any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Our code is available at https://github.com/serjtroshin/rad-q |
| Open Datasets | Yes | For the detoxification evaluation, we follow previous work (Deng & Raffel, 2023; Liu et al., 2021) and evaluate samples from guided decoding given a 10k subset (Liu et al., 2021) of prompts from the Real Toxicity Prompts dataset (Gehman et al., 2020). We follow Deng & Raffel (2023) and Liu et al. (2021) and finetune our model on 2M pairs of text and continuous toxicity responses between 0 and 1 from the Jigsaw Unintended Bias in Toxicity Classification challenge (cjadams et al., 2019). [...] To finetune RAD-Q on responses only, we follow Deng & Raffel (2023) and finetune our model on millions of reviews from the Amazon Polarity (Zhang et al., 2015) and SST-2 (Socher et al., 2013) datasets. |
| Dataset Splits | No | The paper mentions using a '10k subset' of prompts for detoxification evaluation and '2M pairs' for training, and similarly for sentiment control, refers to '2.5K negative, 5K neutral, and 2.5K positive prompts' for evaluation and 'millions of reviews' for training. However, it does not provide specific, reproducible details about how these datasets were split into training, validation, or test sets for the authors' experiments (e.g., specific percentages, sample counts for each split used in their models, or explicit references to standard splits for their model training/evaluation process). |
| Hardware Specification | Yes | In Figure 8, we measure the time per generated token when running the decoding for the toxicity task with RAD-Q and RAD-V (Deng & Raffel, 2023) on a single RTX A6000 GPU. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer (Kingma & Ba, 2015)' and refers to libraries like 'Numpy' and 'PyTorch' in the context of numerical rank calculation, but does not provide specific version numbers for these software components or any other key libraries used in their implementation, which is necessary for reproducibility. |
| Experiment Setup | Yes | To train reward models, we reuse the hyperparameters from Deng & Raffel (2023), where possible. We finetune the reward models with Adam optimizer (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.95, ϵ = 1e 12. We use weight decay 0.01, batch size 100, and the learning rate changes linearly from the initial value (10 5 by default) to zero. [...] For the detoxification task, we finetune RAD-Q with the learning rate 10 5 for 5 epochs. [...] To finetune RAD-Q on responses only for sentiment control task, we first finetune the model with the learning rate 10 5 on the Amazon Polarity dataset, and then finetune it for 5 epochs on the SST-2 dataset with the learning rate 2e 6. For distillation experiment, we finetune RAD-Q for 5 epochs with the learning rate 10 5 on Amazon Polarity dataset. |