reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PIPA: Preference Alignment as Prior-Informed Statistical Estimation

Authors: Junbo Li, Zhangyang Wang, Qiang Liu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Both algorithms demonstrate a 3 10% performance enhancement on the GSM8K and MATH benchmarks across all configurations, achieving these gains without additional training or computational costs compared to existing algorithms.
Researcher Affiliation	Academia	1The University of Texas at Austin, US. Correspondence to: Qiang Liu <EMAIL>.
Pseudocode	Yes	Algorithm 1 PIPA: Prior-Informed Preference Alignment
Open Source Code	No	The paper states: "All experiments are conducted based on Open RLHF (Hu et al., 2024)." This indicates the use of a third-party framework, not the release of the authors' own implementation code for PIPA. No specific link or statement for their code is provided.
Open Datasets	Yes	We use the unpaired preference dataset for math reasoning released by Alpha Math (Chen et al., 2024a), which includes training problems from GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021) with both Co T (Wei et al., 2022) and TIR (Gou et al., 2023)-style solutions, along with step-level label annotations. ... For our evaluation, we use the standard GSM8K and MATH benchmarks.
Dataset Splits	Yes	We use the unpaired preference dataset for math reasoning released by Alpha Math (Chen et al., 2024a), which includes training problems from GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021)... For our evaluation, we use the standard GSM8K and MATH benchmarks.
Hardware Specification	No	This research has been supported by computing support on the Vista GPU Cluster through the Center for Generative AI (CGAI) and the Texas Advanced Computing Center (TACC) at the University of Texas at Austin.
Software Dependencies	No	The paper mentions: "All experiments are conducted based on Open RLHF (Hu et al., 2024)." and "For all training, we use Lo RA (Hu et al., 2021)". However, no specific version numbers are provided for these or any other software components.
Experiment Setup	Yes	For all training, we use Lo RA (Hu et al., 2021) with rank 64 and α = 16. All alignment algorithms are conducted for 1 epoch after the SFT stage. Denote bs to be the batch size and lr to be the learning rate. we do grid search for lr {5 10 7, 5 10 6, 5 10 5} for all experiments and present the best one. SFT Before all alignment algorithms, we first finetune the pre-trained Deepseek and Qwen models on the positive samples for 3 epochs with bs = 1024 and lr = 4 10 5. ... DPO For DPO-based algorithms including DPO, IPO, Step-DPO, we train 1 epoch after the SFT stage, with bs = 256, lr = 5 10 7 and β = 0.1. ... KTO For KTO, we set lr = 5 10 5 for Deepseek model and lr = 5 10 7 for Qwen model. For both, bs = 256, β = 0.1. Step-KTO shares exactly the same recipe with KTO. PIPA We set bs = 256 for all four settings, lr = 5 10 5 for Deepseek and 5 10 7 for Qwen. All settings are the same as KTO and Step-KTO, without additional hyperparameters to be tuned.