RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals

Authors: David Reber, Sean M Richardson, Todd Nief, Cristina Garbacea, Victor Veitch

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental There are two main questions to address empirically: 1. Does RATE correctly estimate the causal effect of attributes on reward models? 2. Is the distinction between RATE and the naive estimator actually substantive? Answering the first question requires knowing ground truth causal effects. To this end, we design semi-synthetic experiments with known ground truth. In this setting, we find that RATE is effective at estimating the true effects, while the naive and single-rewrite estimators fail.
Researcher Affiliation Academia 1 Department of Computer Science, University of Chicago 2 Department of Statistics, University of Chicago 3 Data Science Institute, University of Chicago.
Pseudocode Yes Algorithm 1 RATE: Rewrite-based Attribute Treatment Estimators 1: Input: Dataset {(xi, yi, wi)}, reward model R, function Re() 2: Return: Estimates d ATTRATE, d ATURATE, d ATERATE 3: Initialize n1 P i,j I[wi = 1], n0 P i,j I[wi = 0] 4: d ATTRATE 1 n1 P i:wi=1 [R(xi, Re(Re(yi, 0), 1)) R(xi, Re(yi, 0))] 5: d ATURATE 1 n0 P i:wi=0 [R(xi, Re(yi, 1)) R(xi, Re(Re(yi, 1), 0))] 6: d ATERATE n1 n0+n1 d ATTRATE + n0 n0+n1 d ATURATE 7: Return: d ATTRATE, d ATURATE, d ATERATE
Open Source Code Yes Code is available at https://github.com/toddnief/RATE.
Open Datasets Yes The datasets used in our experiments (IMDB, ELI5, Help Steer, HH RLHF) are publicly available.
Dataset Splits Yes We induce this correlation by partitioning the IMDB dataset (Maas et al., 2011) into four categories: long positive, short positive, long negative, and short negative reviews. We then downsample each category, keeping the total number of samples constant (n = 9374) while increasing the correlation between length and positive sentiment (see Table 3 in Appendix C.2).
Hardware Specification No The paper mentions using 'Open AI Batch API' and 'gpt-4o-2024-08-06 model' for generating rewrites, which are external services and models. It does not provide specific hardware details (like GPU/CPU models, memory) used for running the experiments or for their own model training/inference.
Software Dependencies No The paper mentions using specific LLM models like 'gpt-4o-2024-08-06' for rewrites and reward models such as 'Fsfair X-LLa MA3-RM-v0.1' and 'Distil BERT sentiment classifier'. However, it does not specify software dependencies with version numbers for the experimental environment itself (e.g., Python version, specific libraries like PyTorch or TensorFlow versions).
Experiment Setup Yes Setup For all experiments, we use Open AI Batch API to generate rewrites of text, instructing the LLM to modify the target attribute without changing any other aspects of the response (see Appendix E.1). We use the gpt-4o-2024-08-06 model, incurring a cost of $1.25 per 1M input tokens and $5.00 per 1M output tokens. For instance, generating rewrites and rewrites-of-rewrites for 25K IMDB samples cost roughly $60. See Appendix E for additional implementation details and rewrite samples.