PIPA: Preference Alignment as Prior-Informed Statistical Estimation
Authors: Junbo Li, Zhangyang Wang, Qiang Liu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Both algorithms demonstrate a 3 10% performance enhancement on the GSM8K and MATH benchmarks across all configurations, achieving these gains without additional training or computational costs compared to existing algorithms. |
| Researcher Affiliation | Academia | 1The University of Texas at Austin, US. Correspondence to: Qiang Liu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 PIPA: Prior-Informed Preference Alignment |
| Open Source Code | No | The paper states: "All experiments are conducted based on Open RLHF (Hu et al., 2024)." This indicates the use of a third-party framework, not the release of the authors' own implementation code for PIPA. No specific link or statement for their code is provided. |
| Open Datasets | Yes | We use the unpaired preference dataset for math reasoning released by Alpha Math (Chen et al., 2024a), which includes training problems from GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) with both Co T (Wei et al., 2022) and TIR (Gou et al., 2023)-style solutions, along with step-level label annotations. ... For our evaluation, we use the standard GSM8K and MATH benchmarks. |
| Dataset Splits | Yes | We use the unpaired preference dataset for math reasoning released by Alpha Math (Chen et al., 2024a), which includes training problems from GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021)... For our evaluation, we use the standard GSM8K and MATH benchmarks. |
| Hardware Specification | No | This research has been supported by computing support on the Vista GPU Cluster through the Center for Generative AI (CGAI) and the Texas Advanced Computing Center (TACC) at the University of Texas at Austin. |
| Software Dependencies | No | The paper mentions: "All experiments are conducted based on Open RLHF (Hu et al., 2024)." and "For all training, we use Lo RA (Hu et al., 2021)". However, no specific version numbers are provided for these or any other software components. |
| Experiment Setup | Yes | For all training, we use Lo RA (Hu et al., 2021) with rank 64 and α = 16. All alignment algorithms are conducted for 1 epoch after the SFT stage. Denote bs to be the batch size and lr to be the learning rate. we do grid search for lr {5 10 7, 5 10 6, 5 10 5} for all experiments and present the best one. SFT Before all alignment algorithms, we first finetune the pre-trained Deepseek and Qwen models on the positive samples for 3 epochs with bs = 1024 and lr = 4 10 5. ... DPO For DPO-based algorithms including DPO, IPO, Step-DPO, we train 1 epoch after the SFT stage, with bs = 256, lr = 5 10 7 and β = 0.1. ... KTO For KTO, we set lr = 5 10 5 for Deepseek model and lr = 5 10 7 for Qwen model. For both, bs = 256, β = 0.1. Step-KTO shares exactly the same recipe with KTO. PIPA We set bs = 256 for all four settings, lr = 5 10 5 for Deepseek and 5 10 7 for Qwen. All settings are the same as KTO and Step-KTO, without additional hyperparameters to be tuned. |