Strong Preferences Affect the Robustness of Preference Models and Value Alignment
Authors: Ziwei Xu, Mohan Kankanhalli
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we investigate the robustness of value alignment by examining the sensitivity of preference models. Specifically, we ask: how do changes in the probabilities of some preferences affect the predictions of these models for other preferences? To answer this question, we theoretically analyze the robustness of widely used preference models by examining their sensitivities to minor changes in preferences they model. Our findings reveal that, in the Bradley-Terry and the Placket-Luce model, the probability of a preference can change significantly as other preferences change, especially when these preferences are dominant (i.e., with probabilities near 0 or 1). We identify specific conditions where this sensitivity becomes significant for these models and discuss the practical implications for the robustness and safety of value alignment in AI systems. ... Towards this goal, we use a set of three options Oa = {dog, cat, bird} to synthesize a series of datasets that contain pairwise preferences about animals in Oa, with controllable distribution of preferences. ... The result is illustrated in Fig. 3, from which two observations can be drawn. First, the preference of trained model exhibits a significant shift in learned probabilities (from near 0 to near 1) despite comparatively minor changes in the distribution of training samples. |
| Researcher Affiliation | Academia | Ziwei Xu Department of Computer Science National University of Singapore EMAIL Mohan Kankanhalli Department of Computer Science National University of Singapore EMAIL |
| Pseudocode | No | The paper does not contain any explicitly labeled pseudocode or algorithm blocks. The methodology is presented through mathematical derivations and theoretical analysis. |
| Open Source Code | No | No explicit statement or link providing open-source code for the methodology described in this paper is found. The paper mentions using third-party LLMs like Llama-3-8B-Instruct and zephyr-7b-alpha. |
| Open Datasets | Yes | We study two reward models nvidia/Llama-3.1-Nemotron-70B-Reward-HF (Hugging Face, e) and Open Assistant/reward-model-deberta-v3-large-v2 (Hugging Face, a) under the RLHF framework. These models are studied on the test split of Anthropic/hh-rlhf (Ganguli et al., 2022). |
| Dataset Splits | Yes | We study two reward models nvidia/Llama-3.1-Nemotron-70B-Reward-HF (Hugging Face, e) and Open Assistant/reward-model-deberta-v3-large-v2 (Hugging Face, a) under the RLHF framework. These models are studied on the test split of Anthropic/hh-rlhf (Ganguli et al., 2022). |
| Hardware Specification | Yes | Training and inference of large language models uses two NVIDIA-A100 GPUs each with 40 gigabytes of video memory. |
| Software Dependencies | No | An 8-bit version of the Adam W optimizer (Loshchilov & Hutter, 2019) provided by the Hugging Face s Bitsandbytes package (Hugging Face, b) is used to train the LLMs. ... In each experiment, a Llama-3-8B-Instruct (Hugging Face, d) model is trained on one of the datasets using the DPO algorithm; training is repeated three times with different random seeds. |
| Experiment Setup | Yes | In each experiment session, an LLM is trained using DPO for one epoch with learning rate set to 5e-6 and temperature of DPO loss β set to 0.1. |