Unified Preference Optimization: Language Model Alignment Beyond the Preference Frontier

Authors: Anirudhan Badrinath, Prabhat Agarwal, Jiajing Xu

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate the proposed method, UPO, and compare it with prior methods. Given socially relevant auxiliary objectives and a set of generic datasets that do not overfit or specifically cater to our chosen objectives, we evaluate the proficiency of alignment methods to produce generations aligned with user and designer preferences. Compared to UPO, we show that neither purely RL nor DPO-based approaches can achieve comparable performance in multi-objective optimization with sufficient efficiency and stability.
Researcher Affiliation Industry Anirudhan Badrinath, Prabhat Agarwal, Jiajing Xu EMAIL
Pseudocode Yes Algorithm 1 Training algorithm for UPO given LM πϕ, reference LM πref and dataset D.
Open Source Code No The paper does not provide a direct link to a source code repository or an explicit statement that the code will be made publicly available. It only shows a code snippet for illustration purposes.
Open Datasets Yes Similarly to Ethayarajh et al. (2024), the models are trained on a combination of Anthropic HH (Ganguli et al., 2022), Open Assistant (Köpf et al., 2024) and SHP (Ethayarajh et al., 2022).
Dataset Splits No For evaluation, we use 512 prompts sampled from all datasets. The paper mentions using a combination of datasets for training, but does not provide specific train/validation/test splits, percentages, or sample counts for the training data needed for full reproducibility.
Hardware Specification Yes For compute resources, we use a combination of 8 40GB A100 GPUs and 8 80GB A100 GPUs alongside 96 CPUs and 1 TB of RAM.
Software Dependencies No The paper mentions 'Optimizer RMSprop' but does not specify its version or any other software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Table 6: Hyperparameters for training (shared with all models). Learning Rate (lr) 5e-7, Number of Epochs (n_epochs) 1, Optimizer RMSprop, Warmup Steps 150, Number of Evaluation Data (num_eval_data) 512, Gradient Clipping 10. For UPO, we use a weight of 0.5 and a temperature term of 0.5 (α = 0.5).