reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Unified Preference Optimization: Language Model Alignment Beyond the Preference Frontier

Authors: Anirudhan Badrinath, Prabhat Agarwal, Jiajing Xu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate the proposed method, UPO, and compare it with prior methods. Given socially relevant auxiliary objectives and a set of generic datasets that do not overfit or specifically cater to our chosen objectives, we evaluate the proficiency of alignment methods to produce generations aligned with user and designer preferences. Compared to UPO, we show that neither purely RL nor DPO-based approaches can achieve comparable performance in multi-objective optimization with sufficient efficiency and stability.
Researcher Affiliation	Industry	Anirudhan Badrinath, Prabhat Agarwal, Jiajing Xu EMAIL
Pseudocode	Yes	Algorithm 1 Training algorithm for UPO given LM πϕ, reference LM πref and dataset D.
Open Source Code	No	The paper does not provide a direct link to a source code repository or an explicit statement that the code will be made publicly available. It only shows a code snippet for illustration purposes.
Open Datasets	Yes	Similarly to Ethayarajh et al. (2024), the models are trained on a combination of Anthropic HH (Ganguli et al., 2022), Open Assistant (Köpf et al., 2024) and SHP (Ethayarajh et al., 2022).
Dataset Splits	No	For evaluation, we use 512 prompts sampled from all datasets. The paper mentions using a combination of datasets for training, but does not provide specific train/validation/test splits, percentages, or sample counts for the training data needed for full reproducibility.
Hardware Specification	Yes	For compute resources, we use a combination of 8 40GB A100 GPUs and 8 80GB A100 GPUs alongside 96 CPUs and 1 TB of RAM.
Software Dependencies	No	The paper mentions 'Optimizer RMSprop' but does not specify its version or any other software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Table 6: Hyperparameters for training (shared with all models). Learning Rate (lr) 5e-7, Number of Epochs (n_epochs) 1, Optimizer RMSprop, Warmup Steps 150, Number of Evaluation Data (num_eval_data) 512, Gradient Clipping 10. For UPO, we use a weight of 0.5 and a temperature term of 0.5 (α = 0.5).