reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Strong Preferences Affect the Robustness of Preference Models and Value Alignment

Authors: Ziwei Xu, Mohan Kankanhalli

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we investigate the robustness of value alignment by examining the sensitivity of preference models. Specifically, we ask: how do changes in the probabilities of some preferences affect the predictions of these models for other preferences? To answer this question, we theoretically analyze the robustness of widely used preference models by examining their sensitivities to minor changes in preferences they model. Our findings reveal that, in the Bradley-Terry and the Placket-Luce model, the probability of a preference can change significantly as other preferences change, especially when these preferences are dominant (i.e., with probabilities near 0 or 1). We identify specific conditions where this sensitivity becomes significant for these models and discuss the practical implications for the robustness and safety of value alignment in AI systems. ... Towards this goal, we use a set of three options Oa = {dog, cat, bird} to synthesize a series of datasets that contain pairwise preferences about animals in Oa, with controllable distribution of preferences. ... The result is illustrated in Fig. 3, from which two observations can be drawn. First, the preference of trained model exhibits a significant shift in learned probabilities (from near 0 to near 1) despite comparatively minor changes in the distribution of training samples.
Researcher Affiliation	Academia	Ziwei Xu Department of Computer Science National University of Singapore EMAIL Mohan Kankanhalli Department of Computer Science National University of Singapore EMAIL
Pseudocode	No	The paper does not contain any explicitly labeled pseudocode or algorithm blocks. The methodology is presented through mathematical derivations and theoretical analysis.
Open Source Code	No	No explicit statement or link providing open-source code for the methodology described in this paper is found. The paper mentions using third-party LLMs like Llama-3-8B-Instruct and zephyr-7b-alpha.
Open Datasets	Yes	We study two reward models nvidia/Llama-3.1-Nemotron-70B-Reward-HF (Hugging Face, e) and Open Assistant/reward-model-deberta-v3-large-v2 (Hugging Face, a) under the RLHF framework. These models are studied on the test split of Anthropic/hh-rlhf (Ganguli et al., 2022).
Dataset Splits	Yes	We study two reward models nvidia/Llama-3.1-Nemotron-70B-Reward-HF (Hugging Face, e) and Open Assistant/reward-model-deberta-v3-large-v2 (Hugging Face, a) under the RLHF framework. These models are studied on the test split of Anthropic/hh-rlhf (Ganguli et al., 2022).
Hardware Specification	Yes	Training and inference of large language models uses two NVIDIA-A100 GPUs each with 40 gigabytes of video memory.
Software Dependencies	No	An 8-bit version of the Adam W optimizer (Loshchilov & Hutter, 2019) provided by the Hugging Face s Bitsandbytes package (Hugging Face, b) is used to train the LLMs. ... In each experiment, a Llama-3-8B-Instruct (Hugging Face, d) model is trained on one of the datasets using the DPO algorithm; training is repeated three times with different random seeds.
Experiment Setup	Yes	In each experiment session, an LLM is trained using DPO for one epoch with learning rate set to 5e-6 and temperature of DPO loss β set to 0.1.