reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Interpreting Language Reward Models via Contrastive Explanations

Authors: Junqi Jiang, Tom Bewley, Saumitra Mishra, Freddy Lecue, Manuela Veloso

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we demonstrate the effectiveness of our method for generating high-quality contrastive explanations. Our experiments are conducted on three open source human preference datasets and three RMs. For each dataset, we randomly select 30 binary comparisons from the training set serving as test comparisons (repeated five times with different random seeds, making 150 test comparisons in total), for which we then generate contrastive explanations using our method and the baselines for each RM. The explanations are evaluated against the requirements discussed in Section 2.3 using popular metrics from the text CF literature (Nguyen et al., 2024).
Researcher Affiliation	Collaboration	Imperial College London J.P. Morgan AI Research EMAIL,{firstname.surname}@jpmorgan.com
Pseudocode	No	The paper describes methods using natural language and a visual overview in Figure 2, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a direct link to a code repository for their implementation.
Open Datasets	Yes	We use Help Steer2 (hs2) (Wang et al., 2024b), HH-RLHF-helpful, and HH-RLHF-harmless1 (Bai et al., 2022).
Dataset Splits	No	For each dataset, we randomly select 30 binary comparisons from the training set serving as test comparisons (repeated five times with different random seeds, making 150 test comparisons in total), for which we then generate contrastive explanations using our method and the baselines for each RM.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU models, or cloud computing instances with specifications) used to conduct the experiments.
Software Dependencies	No	The paper mentions several software components and models like GPT-4o, Sentence-BERT, and Polyjuice (Wu et al., 2021), but it does not specify explicit version numbers for these or other key software dependencies required to replicate the experiments.
Experiment Setup	Yes	In our experiments, Y+ and Y both contain 15 perturbed responses, each associated with one attribute from the following list: avoid-to-answer, appropriateness, assertiveness, clarity, coherence, complexity, correctness, engagement, harmlessness, helpfulness, informativeness, neutrality, relevance, sensitivity, verbosity. [...] For each dataset, we randomly select 30 binary comparisons from the training set serving as test comparisons (repeated five times with different random seeds, making 150 test comparisons in total), for which we then generate contrastive explanations using our method and the baselines for each RM.