reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift

Authors: Seongho Son, William Bankes, Sayak Ray Chowdhury, Brooks Paige, Ilija Bogunovic

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we demonstrate the effectiveness of NS-DPO for finetuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced, with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.
Researcher Affiliation	Academia	1Department of Computer Science, University College London, London, United Kingdom 2Department of Computer Science and Engineering, IIT Kanpur, India 3Department of Electronic and Electrical Engineering, University College London, London, United Kingdom. Correspondence to: Ilija Bogunovic <EMAIL>.
Pseudocode	No	The paper describes the methodology using mathematical equations and textual explanations but does not include a distinct pseudocode or algorithm block.
Open Source Code	Yes	We provide code1 for our experiments. 1https://github.com/geronest/ns-dpo
Open Datasets	Yes	To explore the performance of NS-DPO, we construct non-stationary preference datasets from a variety of existing popular datasets; including Global Opinions QA (Durmus et al., 2024), Helpful & Harmless (Dai et al., 2023), and Ultra Feedback (Cui et al., 2023).
Dataset Splits	Yes	We use 10k datapoints for training and 500 datapoints for testing. (C.2.2) We use 15k points for training and 2k for testing. (C.2.4) We divide the prompt-response pairs so that training and test data do not share any prompts. (C.2.1)
Hardware Specification	Yes	To run the LLM experiments, we use A100 GPUs with 40GB VRAM. The synthetic experiments are run locally on a laptop without using GPUs.
Software Dependencies	No	The paper mentions specific language models (Llama-2-7b-chat-hf, Llama-3.2-1b-it) and an evaluation tool (Alpaca Eval2), but does not provide specific version numbers for underlying software dependencies like deep learning frameworks or libraries.
Experiment Setup	Yes	NS-DPO uses τ = 0.1 and γ = 0.95 for fine-tuning Llama-2-7b-chat-hf with 2C NSGO dataset and Ultra Feedback dataset. For the Time Varying Helpful-Harmless (TV-HH) dataset, we adjust the value of γ as γ = 1 ( 1 100 tcp ) log(100). For Llama-3.2-1b-it, we use τ = 1.0 and γ = 0.85. To reduce the compute demands of fine-tuning Llama-2-7b-chat-hf, we train Lo RA weights (Hu et al., 2022).