RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals
Authors: David Reber, Sean M Richardson, Todd Nief, Cristina Garbacea, Victor Veitch
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | There are two main questions to address empirically: 1. Does RATE correctly estimate the causal effect of attributes on reward models? 2. Is the distinction between RATE and the naive estimator actually substantive? Answering the first question requires knowing ground truth causal effects. To this end, we design semi-synthetic experiments with known ground truth. In this setting, we find that RATE is effective at estimating the true effects, while the naive and single-rewrite estimators fail. |
| Researcher Affiliation | Academia | 1 Department of Computer Science, University of Chicago 2 Department of Statistics, University of Chicago 3 Data Science Institute, University of Chicago. |
| Pseudocode | Yes | Algorithm 1 RATE: Rewrite-based Attribute Treatment Estimators 1: Input: Dataset {(xi, yi, wi)}, reward model R, function Re() 2: Return: Estimates d ATTRATE, d ATURATE, d ATERATE 3: Initialize n1 P i,j I[wi = 1], n0 P i,j I[wi = 0] 4: d ATTRATE 1 n1 P i:wi=1 [R(xi, Re(Re(yi, 0), 1)) R(xi, Re(yi, 0))] 5: d ATURATE 1 n0 P i:wi=0 [R(xi, Re(yi, 1)) R(xi, Re(Re(yi, 1), 0))] 6: d ATERATE n1 n0+n1 d ATTRATE + n0 n0+n1 d ATURATE 7: Return: d ATTRATE, d ATURATE, d ATERATE |
| Open Source Code | Yes | Code is available at https://github.com/toddnief/RATE. |
| Open Datasets | Yes | The datasets used in our experiments (IMDB, ELI5, Help Steer, HH RLHF) are publicly available. |
| Dataset Splits | Yes | We induce this correlation by partitioning the IMDB dataset (Maas et al., 2011) into four categories: long positive, short positive, long negative, and short negative reviews. We then downsample each category, keeping the total number of samples constant (n = 9374) while increasing the correlation between length and positive sentiment (see Table 3 in Appendix C.2). |
| Hardware Specification | No | The paper mentions using 'Open AI Batch API' and 'gpt-4o-2024-08-06 model' for generating rewrites, which are external services and models. It does not provide specific hardware details (like GPU/CPU models, memory) used for running the experiments or for their own model training/inference. |
| Software Dependencies | No | The paper mentions using specific LLM models like 'gpt-4o-2024-08-06' for rewrites and reward models such as 'Fsfair X-LLa MA3-RM-v0.1' and 'Distil BERT sentiment classifier'. However, it does not specify software dependencies with version numbers for the experimental environment itself (e.g., Python version, specific libraries like PyTorch or TensorFlow versions). |
| Experiment Setup | Yes | Setup For all experiments, we use Open AI Batch API to generate rewrites of text, instructing the LLM to modify the target attribute without changing any other aspects of the response (see Appendix E.1). We use the gpt-4o-2024-08-06 model, incurring a cost of $1.25 per 1M input tokens and $5.00 per 1M output tokens. For instance, generating rewrites and rewrites-of-rewrites for 25K IMDB samples cost roughly $60. See Appendix E for additional implementation details and rewrite samples. |