Square$χ$PO: Differentially Private and Robust $χ^2$-Preference Optimization in Offline Direct Alignment
Authors: Xingyu Zhou, Yulian Wu, Wenqian Weng, Francesco Orabona
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduced SquareχPO, a novel offline alignment method that achieves state-of-the-art theoretical guarantees in the presence of noisy labels caused by privacy protections and/or adversarial corruption. Our algorithm can handle both BT-preference and general preference models. While our primary focus is theoretical, SquareχPO remains practical and easy to implement, requiring only a minor modification to χPO and DPO. Future work will focus on comprehensive empirical evaluations to further validate our findings. ... E. Experiments Dataset. We utilize GPT-4o to generate a synthetic dataset, referred to as finance preference, which comprises 1697 preference samples. ... Evaluation. We evaluate our trained models by generating responses for the test dataset. ... We compute the average and standard deviation across 5 random seeds. |
| Researcher Affiliation | Academia | 1Wayne State University, USA 2King Abdullah University of Science and Technology, Saudi Arabia. Correspondence to: Xingyu Zhou <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 SquareχPO for CTL and LTC, Algorithm 2 SquareχPO for c DP, Algorithm 3 Iterative SquareχPO under Corruption and Privacy Protection |
| Open Source Code | No | Future work will focus on comprehensive empirical evaluations to further validate our findings. |
| Open Datasets | No | We utilize GPT-4o to generate a synthetic dataset, referred to as finance preference, which comprises 1697 preference samples. Each sample includes a prompt related to a financial scenario and two possible responses, where rejected represents the high-risk option and chosen represents the low-risk option. |
| Dataset Splits | Yes | For alignment training, we split the dataset into 85% for training, 5% for validation, and 10% for testing. |
| Hardware Specification | No | The paper does not provide specific hardware details used for running its experiments. |
| Software Dependencies | No | We begin by fine-tuning GPT2-large using the finance sft dataset to obtain the SFT policy, πsft. For this, we directly utilize the SFT trainer from the Transformer Reinforcement Learning (TRL) library (von Werra et al., 2020). |
| Experiment Setup | Yes | We have compared the performance of χPO and SquareχPO under CTL and LTC settings with ε = 0.5 and α = 0.1. ... We compute the average and standard deviation across 5 random seeds. ... Suppose Algorithm 3 is invoked with β = 1/T and η = 1/T |