Projection Optimization: A General Framework for Multi-Objective and Multi-Group RLHF
Authors: Nuoya Xiong, Aarti Singh
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, leveraging our theoretical insights, we propose a nearly training-free algorithm once the optimal policies for individual objectives are obtained. [...] 6. Experiments We fine-tune a LLAMA2-7B model using Anthropic-HH dataset (Bai et al., 2022) with three different objectives of an LM assistant: Humor, Helpful, and Harmless. [...] The experimental results show that MOPO performs generally better. |
| Researcher Affiliation | Academia | 1Carnegie Mellon University, PA, USA. Correspondence to: Aarti Singh <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 MOP-Reward Based (RB) [...] Algorithm 2 MOP-Reward Free (RF) (Online Version) [...] Algorithm 3 MOPO-Offline [...] Algorithm 4 VPO-objective-learning-general [...] Algorithm 5 MOPO(Practical Version)-Offline |
| Open Source Code | No | The paper discusses using several existing reward models available on Hugging Face (e.g., "https://huggingface.co/Ray2333/gpt2-large-harmless-reward_model"). However, there is no explicit statement or link indicating that the authors have made their own implementation code for the proposed methodology publicly available. |
| Open Datasets | Yes | We fine-tune a LLAMA2-7B model using Anthropic-HH dataset (Bai et al., 2022) |
| Dataset Splits | No | The paper mentions using the "Anthropic-HH dataset" but does not provide specific details regarding its training, validation, or test splits. It does not mention percentages, counts, or predefined split strategies. |
| Hardware Specification | No | The paper mentions fine-tuning a "LLAMA2-7B model" and running experiments but does not specify any hardware details such as GPU models, CPU types, or memory used for these experiments. |
| Software Dependencies | No | The paper mentions fine-tuning a "LLAMA2-7B model" and using "MOD (Shi et al., 2024)" and a "PPO approach" but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in their implementation. |
| Experiment Setup | Yes | In our experiments, we set the number of iterations to 7, striking a balance between computational efficiency and performance. To compute the expected reward vector V t, we calculate the expectation by taking the expectation over 100 training samples |