Unified Preference Optimization: Language Model Alignment Beyond the Preference Frontier
Authors: Anirudhan Badrinath, Prabhat Agarwal, Jiajing Xu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate the proposed method, UPO, and compare it with prior methods. Given socially relevant auxiliary objectives and a set of generic datasets that do not overfit or specifically cater to our chosen objectives, we evaluate the proficiency of alignment methods to produce generations aligned with user and designer preferences. Compared to UPO, we show that neither purely RL nor DPO-based approaches can achieve comparable performance in multi-objective optimization with sufficient efficiency and stability. |
| Researcher Affiliation | Industry | Anirudhan Badrinath, Prabhat Agarwal, Jiajing Xu EMAIL |
| Pseudocode | Yes | Algorithm 1 Training algorithm for UPO given LM πϕ, reference LM πref and dataset D. |
| Open Source Code | No | The paper does not provide a direct link to a source code repository or an explicit statement that the code will be made publicly available. It only shows a code snippet for illustration purposes. |
| Open Datasets | Yes | Similarly to Ethayarajh et al. (2024), the models are trained on a combination of Anthropic HH (Ganguli et al., 2022), Open Assistant (Köpf et al., 2024) and SHP (Ethayarajh et al., 2022). |
| Dataset Splits | No | For evaluation, we use 512 prompts sampled from all datasets. The paper mentions using a combination of datasets for training, but does not provide specific train/validation/test splits, percentages, or sample counts for the training data needed for full reproducibility. |
| Hardware Specification | Yes | For compute resources, we use a combination of 8 40GB A100 GPUs and 8 80GB A100 GPUs alongside 96 CPUs and 1 TB of RAM. |
| Software Dependencies | No | The paper mentions 'Optimizer RMSprop' but does not specify its version or any other software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Table 6: Hyperparameters for training (shared with all models). Learning Rate (lr) 5e-7, Number of Epochs (n_epochs) 1, Optimizer RMSprop, Warmup Steps 150, Number of Evaluation Data (num_eval_data) 512, Gradient Clipping 10. For UPO, we use a weight of 0.5 and a temperature term of 0.5 (α = 0.5). |