reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Square$χ$PO: Differentially Private and Robust $χ^2$-Preference Optimization in Offline Direct Alignment

Authors: Xingyu Zhou, Yulian Wu, Wenqian Weng, Francesco Orabona

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduced SquareχPO, a novel offline alignment method that achieves state-of-the-art theoretical guarantees in the presence of noisy labels caused by privacy protections and/or adversarial corruption. Our algorithm can handle both BT-preference and general preference models. While our primary focus is theoretical, SquareχPO remains practical and easy to implement, requiring only a minor modification to χPO and DPO. Future work will focus on comprehensive empirical evaluations to further validate our findings. ... E. Experiments Dataset. We utilize GPT-4o to generate a synthetic dataset, referred to as finance preference, which comprises 1697 preference samples. ... Evaluation. We evaluate our trained models by generating responses for the test dataset. ... We compute the average and standard deviation across 5 random seeds.
Researcher Affiliation	Academia	1Wayne State University, USA 2King Abdullah University of Science and Technology, Saudi Arabia. Correspondence to: Xingyu Zhou <EMAIL>.
Pseudocode	Yes	Algorithm 1 SquareχPO for CTL and LTC, Algorithm 2 SquareχPO for c DP, Algorithm 3 Iterative SquareχPO under Corruption and Privacy Protection
Open Source Code	No	Future work will focus on comprehensive empirical evaluations to further validate our findings.
Open Datasets	No	We utilize GPT-4o to generate a synthetic dataset, referred to as finance preference, which comprises 1697 preference samples. Each sample includes a prompt related to a financial scenario and two possible responses, where rejected represents the high-risk option and chosen represents the low-risk option.
Dataset Splits	Yes	For alignment training, we split the dataset into 85% for training, 5% for validation, and 10% for testing.
Hardware Specification	No	The paper does not provide specific hardware details used for running its experiments.
Software Dependencies	No	We begin by fine-tuning GPT2-large using the finance sft dataset to obtain the SFT policy, πsft. For this, we directly utilize the SFT trainer from the Transformer Reinforcement Learning (TRL) library (von Werra et al., 2020).
Experiment Setup	Yes	We have compared the performance of χPO and SquareχPO under CTL and LTC settings with ε = 0.5 and α = 0.1. ... We compute the average and standard deviation across 5 random seeds. ... Suppose Algorithm 3 is invoked with β = 1/T and η = 1/T