Projection Optimization: A General Framework for Multi-Objective and Multi-Group RLHF

Authors: Nuoya Xiong, Aarti Singh

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, leveraging our theoretical insights, we propose a nearly training-free algorithm once the optimal policies for individual objectives are obtained. [...] 6. Experiments We fine-tune a LLAMA2-7B model using Anthropic-HH dataset (Bai et al., 2022) with three different objectives of an LM assistant: Humor, Helpful, and Harmless. [...] The experimental results show that MOPO performs generally better.
Researcher Affiliation Academia 1Carnegie Mellon University, PA, USA. Correspondence to: Aarti Singh <EMAIL>.
Pseudocode Yes Algorithm 1 MOP-Reward Based (RB) [...] Algorithm 2 MOP-Reward Free (RF) (Online Version) [...] Algorithm 3 MOPO-Offline [...] Algorithm 4 VPO-objective-learning-general [...] Algorithm 5 MOPO(Practical Version)-Offline
Open Source Code No The paper discusses using several existing reward models available on Hugging Face (e.g., "https://huggingface.co/Ray2333/gpt2-large-harmless-reward_model"). However, there is no explicit statement or link indicating that the authors have made their own implementation code for the proposed methodology publicly available.
Open Datasets Yes We fine-tune a LLAMA2-7B model using Anthropic-HH dataset (Bai et al., 2022)
Dataset Splits No The paper mentions using the "Anthropic-HH dataset" but does not provide specific details regarding its training, validation, or test splits. It does not mention percentages, counts, or predefined split strategies.
Hardware Specification No The paper mentions fine-tuning a "LLAMA2-7B model" and running experiments but does not specify any hardware details such as GPU models, CPU types, or memory used for these experiments.
Software Dependencies No The paper mentions fine-tuning a "LLAMA2-7B model" and using "MOD (Shi et al., 2024)" and a "PPO approach" but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in their implementation.
Experiment Setup Yes In our experiments, we set the number of iterations to 7, striking a balance between computational efficiency and performance. To compute the expected reward vector V t, we calculate the expectation by taking the expectation over 100 training samples