Interpreting Language Reward Models via Contrastive Explanations

Authors: Junqi Jiang, Tom Bewley, Saumitra Mishra, Freddy Lecue, Manuela Veloso

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we demonstrate the effectiveness of our method for generating high-quality contrastive explanations. Our experiments are conducted on three open source human preference datasets and three RMs. For each dataset, we randomly select 30 binary comparisons from the training set serving as test comparisons (repeated five times with different random seeds, making 150 test comparisons in total), for which we then generate contrastive explanations using our method and the baselines for each RM. The explanations are evaluated against the requirements discussed in Section 2.3 using popular metrics from the text CF literature (Nguyen et al., 2024).
Researcher Affiliation Collaboration Imperial College London J.P. Morgan AI Research EMAIL,{firstname.surname}@jpmorgan.com
Pseudocode No The paper describes methods using natural language and a visual overview in Figure 2, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a direct link to a code repository for their implementation.
Open Datasets Yes We use Help Steer2 (hs2) (Wang et al., 2024b), HH-RLHF-helpful, and HH-RLHF-harmless1 (Bai et al., 2022).
Dataset Splits No For each dataset, we randomly select 30 binary comparisons from the training set serving as test comparisons (repeated five times with different random seeds, making 150 test comparisons in total), for which we then generate contrastive explanations using our method and the baselines for each RM.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU models, or cloud computing instances with specifications) used to conduct the experiments.
Software Dependencies No The paper mentions several software components and models like GPT-4o, Sentence-BERT, and Polyjuice (Wu et al., 2021), but it does not specify explicit version numbers for these or other key software dependencies required to replicate the experiments.
Experiment Setup Yes In our experiments, Y+ and Y both contain 15 perturbed responses, each associated with one attribute from the following list: avoid-to-answer, appropriateness, assertiveness, clarity, coherence, complexity, correctness, engagement, harmlessness, helpfulness, informativeness, neutrality, relevance, sensitivity, verbosity. [...] For each dataset, we randomly select 30 binary comparisons from the training set serving as test comparisons (repeated five times with different random seeds, making 150 test comparisons in total), for which we then generate contrastive explanations using our method and the baselines for each RM.