Square$χ$PO: Differentially Private and Robust $χ^2$-Preference Optimization in Offline Direct Alignment

Authors: Xingyu Zhou, Yulian Wu, Wenqian Weng, Francesco Orabona

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduced SquareχPO, a novel offline alignment method that achieves state-of-the-art theoretical guarantees in the presence of noisy labels caused by privacy protections and/or adversarial corruption. Our algorithm can handle both BT-preference and general preference models. While our primary focus is theoretical, SquareχPO remains practical and easy to implement, requiring only a minor modification to χPO and DPO. Future work will focus on comprehensive empirical evaluations to further validate our findings. ... E. Experiments Dataset. We utilize GPT-4o to generate a synthetic dataset, referred to as finance preference, which comprises 1697 preference samples. ... Evaluation. We evaluate our trained models by generating responses for the test dataset. ... We compute the average and standard deviation across 5 random seeds.
Researcher Affiliation Academia 1Wayne State University, USA 2King Abdullah University of Science and Technology, Saudi Arabia. Correspondence to: Xingyu Zhou <EMAIL>.
Pseudocode Yes Algorithm 1 SquareχPO for CTL and LTC, Algorithm 2 SquareχPO for c DP, Algorithm 3 Iterative SquareχPO under Corruption and Privacy Protection
Open Source Code No Future work will focus on comprehensive empirical evaluations to further validate our findings.
Open Datasets No We utilize GPT-4o to generate a synthetic dataset, referred to as finance preference, which comprises 1697 preference samples. Each sample includes a prompt related to a financial scenario and two possible responses, where rejected represents the high-risk option and chosen represents the low-risk option.
Dataset Splits Yes For alignment training, we split the dataset into 85% for training, 5% for validation, and 10% for testing.
Hardware Specification No The paper does not provide specific hardware details used for running its experiments.
Software Dependencies No We begin by fine-tuning GPT2-large using the finance sft dataset to obtain the SFT policy, πsft. For this, we directly utilize the SFT trainer from the Transformer Reinforcement Learning (TRL) library (von Werra et al., 2020).
Experiment Setup Yes We have compared the performance of χPO and SquareχPO under CTL and LTC settings with ε = 0.5 and α = 0.1. ... We compute the average and standard deviation across 5 random seeds. ... Suppose Algorithm 3 is invoked with β = 1/T and η = 1/T