Interpreting Language Reward Models via Contrastive Explanations
Authors: Junqi Jiang, Tom Bewley, Saumitra Mishra, Freddy Lecue, Manuela Veloso
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we demonstrate the effectiveness of our method for generating high-quality contrastive explanations. Our experiments are conducted on three open source human preference datasets and three RMs. For each dataset, we randomly select 30 binary comparisons from the training set serving as test comparisons (repeated five times with different random seeds, making 150 test comparisons in total), for which we then generate contrastive explanations using our method and the baselines for each RM. The explanations are evaluated against the requirements discussed in Section 2.3 using popular metrics from the text CF literature (Nguyen et al., 2024). |
| Researcher Affiliation | Collaboration | Imperial College London J.P. Morgan AI Research EMAIL,{firstname.surname}@jpmorgan.com |
| Pseudocode | No | The paper describes methods using natural language and a visual overview in Figure 2, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a direct link to a code repository for their implementation. |
| Open Datasets | Yes | We use Help Steer2 (hs2) (Wang et al., 2024b), HH-RLHF-helpful, and HH-RLHF-harmless1 (Bai et al., 2022). |
| Dataset Splits | No | For each dataset, we randomly select 30 binary comparisons from the training set serving as test comparisons (repeated five times with different random seeds, making 150 test comparisons in total), for which we then generate contrastive explanations using our method and the baselines for each RM. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU models, or cloud computing instances with specifications) used to conduct the experiments. |
| Software Dependencies | No | The paper mentions several software components and models like GPT-4o, Sentence-BERT, and Polyjuice (Wu et al., 2021), but it does not specify explicit version numbers for these or other key software dependencies required to replicate the experiments. |
| Experiment Setup | Yes | In our experiments, Y+ and Y both contain 15 perturbed responses, each associated with one attribute from the following list: avoid-to-answer, appropriateness, assertiveness, clarity, coherence, complexity, correctness, engagement, harmlessness, helpfulness, informativeness, neutrality, relevance, sensitivity, verbosity. [...] For each dataset, we randomly select 30 binary comparisons from the training set serving as test comparisons (repeated five times with different random seeds, making 150 test comparisons in total), for which we then generate contrastive explanations using our method and the baselines for each RM. |