reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MAP: Multi-Human-Value Alignment Palette

Authors: Xinran Wang, Qi Le, Ammar Ahmed, Enmao Diao, Yi Zhou, Nathalie Baracaldo, Jie Ding, Ali Anwar

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate MAP s ability to align multiple values in a principled manner while delivering strong empirical performance across various tasks. 3 EXPERIMENTAL STUDY
Researcher Affiliation	Collaboration	1University of Minnesota 2IBM Research EMAIL, diao EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 MAP Procedure Algorithm 2 Automatic Palette Adjustment via Interpolation Algorithm 3 Automatic Palette Adjustment via Greedy Search
Open Source Code	Yes	Our code is available at https://github.com/wang8740/MAP.
Open Datasets	Yes	We generate prompts from two data sources: Anthropic harmless data (Bai et al., 2022), which includes human requests delineated between the tags Human: and Assistant: , and IMDB data (Maas et al., 2011) from which we retain movie reviews exceeding 30 characters in length.
Dataset Splits	No	The paper mentions using a "test split of our task data" for evaluation but does not provide specific percentages or absolute counts for training, validation, and test datasets in the main text. It mentions creating a pilot dataset of n=2000 for quantile estimation, but this is not the main train/test/val split.
Hardware Specification	Yes	Our experiments were conducted using a single Nvidia A100 GPU.
Software Dependencies	No	In terms of model finetuning, we utilized the TRL package (von Werra et al., 2020) for DPO and PPO training. Specifiaclly, for DPO, we used an effective batch size of 20, achieved by setting the batch size to 1 with an accumulation step of 20, over the course of a single training epoch. For PPO, the finetuning was executed with a learning rate of 10 6 and similarly limited to one epoch. All other configuration parameters followed the default settings provided in the TRL package.
Experiment Setup	Yes	For data generation, we employed a top-k decoding approach with a fixed k = 50 and a limit of 50 new tokens per sequence. For DPO, we used an effective batch size of 20, achieved by setting the batch size to 1 with an accumulation step of 20, over the course of a single training epoch. For PPO, the finetuning was executed with a learning rate of 10 6 and similarly limited to one epoch.