Strong Preferences Affect the Robustness of Preference Models and Value Alignment

Authors: Ziwei Xu, Mohan Kankanhalli

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we investigate the robustness of value alignment by examining the sensitivity of preference models. Specifically, we ask: how do changes in the probabilities of some preferences affect the predictions of these models for other preferences? To answer this question, we theoretically analyze the robustness of widely used preference models by examining their sensitivities to minor changes in preferences they model. Our findings reveal that, in the Bradley-Terry and the Placket-Luce model, the probability of a preference can change significantly as other preferences change, especially when these preferences are dominant (i.e., with probabilities near 0 or 1). We identify specific conditions where this sensitivity becomes significant for these models and discuss the practical implications for the robustness and safety of value alignment in AI systems. ... Towards this goal, we use a set of three options Oa = {dog, cat, bird} to synthesize a series of datasets that contain pairwise preferences about animals in Oa, with controllable distribution of preferences. ... The result is illustrated in Fig. 3, from which two observations can be drawn. First, the preference of trained model exhibits a significant shift in learned probabilities (from near 0 to near 1) despite comparatively minor changes in the distribution of training samples.
Researcher Affiliation Academia Ziwei Xu Department of Computer Science National University of Singapore EMAIL Mohan Kankanhalli Department of Computer Science National University of Singapore EMAIL
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks. The methodology is presented through mathematical derivations and theoretical analysis.
Open Source Code No No explicit statement or link providing open-source code for the methodology described in this paper is found. The paper mentions using third-party LLMs like Llama-3-8B-Instruct and zephyr-7b-alpha.
Open Datasets Yes We study two reward models nvidia/Llama-3.1-Nemotron-70B-Reward-HF (Hugging Face, e) and Open Assistant/reward-model-deberta-v3-large-v2 (Hugging Face, a) under the RLHF framework. These models are studied on the test split of Anthropic/hh-rlhf (Ganguli et al., 2022).
Dataset Splits Yes We study two reward models nvidia/Llama-3.1-Nemotron-70B-Reward-HF (Hugging Face, e) and Open Assistant/reward-model-deberta-v3-large-v2 (Hugging Face, a) under the RLHF framework. These models are studied on the test split of Anthropic/hh-rlhf (Ganguli et al., 2022).
Hardware Specification Yes Training and inference of large language models uses two NVIDIA-A100 GPUs each with 40 gigabytes of video memory.
Software Dependencies No An 8-bit version of the Adam W optimizer (Loshchilov & Hutter, 2019) provided by the Hugging Face s Bitsandbytes package (Hugging Face, b) is used to train the LLMs. ... In each experiment, a Llama-3-8B-Instruct (Hugging Face, d) model is trained on one of the datasets using the DPO algorithm; training is repeated three times with different random seeds.
Experiment Setup Yes In each experiment session, an LLM is trained using DPO for one epoch with learning rate set to 5e-6 and temperature of DPO loss β set to 0.1.