Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift
Authors: Seongho Son, William Bankes, Sayak Ray Chowdhury, Brooks Paige, Ilija Bogunovic
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we demonstrate the effectiveness of NS-DPO for finetuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced, with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University College London, London, United Kingdom 2Department of Computer Science and Engineering, IIT Kanpur, India 3Department of Electronic and Electrical Engineering, University College London, London, United Kingdom. Correspondence to: Ilija Bogunovic <EMAIL>. |
| Pseudocode | No | The paper describes the methodology using mathematical equations and textual explanations but does not include a distinct pseudocode or algorithm block. |
| Open Source Code | Yes | We provide code1 for our experiments. 1https://github.com/geronest/ns-dpo |
| Open Datasets | Yes | To explore the performance of NS-DPO, we construct non-stationary preference datasets from a variety of existing popular datasets; including Global Opinions QA (Durmus et al., 2024), Helpful & Harmless (Dai et al., 2023), and Ultra Feedback (Cui et al., 2023). |
| Dataset Splits | Yes | We use 10k datapoints for training and 500 datapoints for testing. (C.2.2) We use 15k points for training and 2k for testing. (C.2.4) We divide the prompt-response pairs so that training and test data do not share any prompts. (C.2.1) |
| Hardware Specification | Yes | To run the LLM experiments, we use A100 GPUs with 40GB VRAM. The synthetic experiments are run locally on a laptop without using GPUs. |
| Software Dependencies | No | The paper mentions specific language models (Llama-2-7b-chat-hf, Llama-3.2-1b-it) and an evaluation tool (Alpaca Eval2), but does not provide specific version numbers for underlying software dependencies like deep learning frameworks or libraries. |
| Experiment Setup | Yes | NS-DPO uses τ = 0.1 and γ = 0.95 for fine-tuning Llama-2-7b-chat-hf with 2C NSGO dataset and Ultra Feedback dataset. For the Time Varying Helpful-Harmless (TV-HH) dataset, we adjust the value of γ as γ = 1 ( 1 100 tcp ) log(100). For Llama-3.2-1b-it, we use τ = 1.0 and γ = 0.85. To reduce the compute demands of fine-tuning Llama-2-7b-chat-hf, we train Lo RA weights (Hu et al., 2022). |