reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Projection Optimization: A General Framework for Multi-Objective and Multi-Group RLHF

Authors: Nuoya Xiong, Aarti Singh

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, leveraging our theoretical insights, we propose a nearly training-free algorithm once the optimal policies for individual objectives are obtained. [...] 6. Experiments We fine-tune a LLAMA2-7B model using Anthropic-HH dataset (Bai et al., 2022) with three different objectives of an LM assistant: Humor, Helpful, and Harmless. [...] The experimental results show that MOPO performs generally better.
Researcher Affiliation	Academia	1Carnegie Mellon University, PA, USA. Correspondence to: Aarti Singh <EMAIL>.
Pseudocode	Yes	Algorithm 1 MOP-Reward Based (RB) [...] Algorithm 2 MOP-Reward Free (RF) (Online Version) [...] Algorithm 3 MOPO-Offline [...] Algorithm 4 VPO-objective-learning-general [...] Algorithm 5 MOPO(Practical Version)-Offline
Open Source Code	No	The paper discusses using several existing reward models available on Hugging Face (e.g., "https://huggingface.co/Ray2333/gpt2-large-harmless-reward_model"). However, there is no explicit statement or link indicating that the authors have made their own implementation code for the proposed methodology publicly available.
Open Datasets	Yes	We fine-tune a LLAMA2-7B model using Anthropic-HH dataset (Bai et al., 2022)
Dataset Splits	No	The paper mentions using the "Anthropic-HH dataset" but does not provide specific details regarding its training, validation, or test splits. It does not mention percentages, counts, or predefined split strategies.
Hardware Specification	No	The paper mentions fine-tuning a "LLAMA2-7B model" and running experiments but does not specify any hardware details such as GPU models, CPU types, or memory used for these experiments.
Software Dependencies	No	The paper mentions fine-tuning a "LLAMA2-7B model" and using "MOD (Shi et al., 2024)" and a "PPO approach" but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in their implementation.
Experiment Setup	Yes	In our experiments, we set the number of iterations to 7, striking a balance between computational efficiency and performance. To compute the expected reward vector V t, we calculate the expectation by taking the expectation over 100 training samples