reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals

Authors: David Reber, Sean M Richardson, Todd Nief, Cristina Garbacea, Victor Veitch

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	There are two main questions to address empirically: 1. Does RATE correctly estimate the causal effect of attributes on reward models? 2. Is the distinction between RATE and the naive estimator actually substantive? Answering the first question requires knowing ground truth causal effects. To this end, we design semi-synthetic experiments with known ground truth. In this setting, we find that RATE is effective at estimating the true effects, while the naive and single-rewrite estimators fail.
Researcher Affiliation	Academia	1 Department of Computer Science, University of Chicago 2 Department of Statistics, University of Chicago 3 Data Science Institute, University of Chicago.
Pseudocode	Yes	Algorithm 1 RATE: Rewrite-based Attribute Treatment Estimators 1: Input: Dataset {(xi, yi, wi)}, reward model R, function Re() 2: Return: Estimates d ATTRATE, d ATURATE, d ATERATE 3: Initialize n1 P i,j I[wi = 1], n0 P i,j I[wi = 0] 4: d ATTRATE 1 n1 P i:wi=1 [R(xi, Re(Re(yi, 0), 1)) R(xi, Re(yi, 0))] 5: d ATURATE 1 n0 P i:wi=0 [R(xi, Re(yi, 1)) R(xi, Re(Re(yi, 1), 0))] 6: d ATERATE n1 n0+n1 d ATTRATE + n0 n0+n1 d ATURATE 7: Return: d ATTRATE, d ATURATE, d ATERATE
Open Source Code	Yes	Code is available at https://github.com/toddnief/RATE.
Open Datasets	Yes	The datasets used in our experiments (IMDB, ELI5, Help Steer, HH RLHF) are publicly available.
Dataset Splits	Yes	We induce this correlation by partitioning the IMDB dataset (Maas et al., 2011) into four categories: long positive, short positive, long negative, and short negative reviews. We then downsample each category, keeping the total number of samples constant (n = 9374) while increasing the correlation between length and positive sentiment (see Table 3 in Appendix C.2).
Hardware Specification	No	The paper mentions using 'Open AI Batch API' and 'gpt-4o-2024-08-06 model' for generating rewrites, which are external services and models. It does not provide specific hardware details (like GPU/CPU models, memory) used for running the experiments or for their own model training/inference.
Software Dependencies	No	The paper mentions using specific LLM models like 'gpt-4o-2024-08-06' for rewrites and reward models such as 'Fsfair X-LLa MA3-RM-v0.1' and 'Distil BERT sentiment classifier'. However, it does not specify software dependencies with version numbers for the experimental environment itself (e.g., Python version, specific libraries like PyTorch or TensorFlow versions).
Experiment Setup	Yes	Setup For all experiments, we use Open AI Batch API to generate rewrites of text, instructing the LLM to modify the target attribute without changing any other aspects of the response (see Appendix E.1). We use the gpt-4o-2024-08-06 model, incurring a cost of $1.25 per 1M input tokens and $5.00 per 1M output tokens. For instance, generating rewrites and rewrites-of-rewrites for 25K IMDB samples cost roughly $60. See Appendix E for additional implementation details and rewrite samples.