Preference Learning for AI Alignment: a Causal Perspective
Authors: Kasia Kobalczyk, Mihaela Van Der Schaar
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment (Limited latent positivity). We illustrate the significance of the latent positivity by examining its influence on standard BTL models. Using the Ultra Feedback dataset (Cui et al., 2024), we consider the truthfulness and instruction following factors of each prompt-response pair, denoted Z1 and Z2, respectively, and scored from 0 to 5. We construct five training datasets by varying the correlation coefficient ρtr between Z1 and Z2. We let the reward function be r(x, y) = 1/4z^2, with the true values of z1 and z2 not available for training. The datasets consists of tuples (x, y, y', ℓ) with ℓ determined by the function r. We assess the robustness of the learned reward models to shifts in the correlation of Z1 and Z2, testing on previously unseen examples either from the same distribution as the training examples (ID: ρtest = ρtr), or not, with ρtest being negative (OOD: ρtest < 0). Refer to Appendix D.2 for details. |
| Researcher Affiliation | Academia | Katarzyna Kobalczyk 1 Mihaela van der Schaar 1 1Department of Applied Mathematics and Theoretical Physics, University of Cambridge, United Kingdom. Correspondence to: Katarzyna Kobalczyk <EMAIL>. |
| Pseudocode | No | The paper describes methods and models through narrative text, mathematical equations, and figures (like diagrams of model architectures), but it does not contain any explicitly labeled "Pseudocode" or "Algorithm" blocks or structured, step-by-step procedures formatted like code. |
| Open Source Code | Yes | Code for reproducing the experiments is made available at: https://github.com/kasia-kobalczyk/causal-preference-learning. |
| Open Datasets | Yes | Using the Ultra Feedback dataset (Cui et al., 2024), we consider the truthfulness and instruction following factors of each prompt-response pair, denoted Z1 and Z2, respectively, and scored from 0 to 5. We rely on the extended version of the HH-RLHF dataset (Bai et al., 2022) as provided by Siththaranjan et al. (2024). |
| Dataset Splits | Yes | D.2. Experimental Details Datasets. We create a large dataset of candidate pairs (x, y, y', z, z', ℓ), where z and z' are 2-dimensional vectors corresponding to the truthfulness and instruction-following scores of the of the candidate options (x, y) and (x, y'), respectively. Training: With stratified sampling based on the value of z−z', we create 4 training datasets with 15.000 samples, one for each value of ρtr {0.0, 0.3, 0.6, 0.9}. The resulting training sets contain tuples (x, y, y', ℓ) the ground-truth values of the latent factors are not available during training and need to be approximated in an unsupervised fashion. Validation: Validation splits used to determine the optimal stopping point during training of the reward models are samples according to the same distribution as the training datasets and contain 2.000 samples. Testing: For each value of ρtr we also create a testing dataset with matching value of the correlation coefficient (ID), and a testing dataset with only samples falling into the second and fourth quadrants in the (δ1, δ2)-plane, resulting in a correlation coefficient of -0.8 (OOD). The testing datasets are always disjoint from the training examples and contain 15.000 samples. E.1. Experimental Details Dataset. We rely on the extended version of the HH-RLHF dataset (Bai et al., 2022) as provided by Siththaranjan et al. (2024). The entire dataset can be represented as tuples (t, x, y, y', c, ℓ), where c {0, 1} denotes the objective with which the choice ℓ {0, 1} is made and t {0, 1} denotes the type of x, i.e. whether (x, y, y') was originally part of the helpful (t = 0) or harmless split (t = 1). In the original data (Bai et al., 2022) we only observe examples with t = c. Siththaranjan et al. (2024) augment this dataset with counterfactual labels for such that t = c which we refer to as inconsistent samples. We create six independent training datasets, controlling the ratio of consistent to inconsistent samples, i.e. the parameter ρ = P(type(X) = C) {0.5, 0.6, 0.7, 0.8, 0.9, 1.0}. The resulting training datasets consist of 30.000 samples (x, y, y', c, ℓ), with the label t not being part of the training sets. We also create validation splits with the same values of ρ s of 6.000 sample. The remaining 46518 samples is left for testing. |
| Hardware Specification | No | The paper mentions using embeddings from the Llama-3-8B* model and performing training, but does not provide any specific details about the hardware used, such as GPU models, CPU types, or cloud computing instance specifications. |
| Software Dependencies | No | The paper mentions using the Adam optimiser, GELU activation functions, and the Llama-3-8B* model for embeddings, but it does not specify any software libraries or frameworks with version numbers (e.g., PyTorch version, TensorFlow version, Hugging Face Transformers version). |
| Experiment Setup | Yes | Reward model training. ... All models are trained using the Adam optimiser with a learning rate of 1e-4 for 10 epochs. Model weights θ with the highest validation accuracy are saved for evaluation. Adversarial: ... λ is a hyperparameter balancing the two objectives. ... which in our experiments we set to 1.0. |