MAP: Multi-Human-Value Alignment Palette

Authors: Xinran Wang, Qi Le, Ammar Ahmed, Enmao Diao, Yi Zhou, Nathalie Baracaldo, Jie Ding, Ali Anwar

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate MAP s ability to align multiple values in a principled manner while delivering strong empirical performance across various tasks. 3 EXPERIMENTAL STUDY
Researcher Affiliation Collaboration 1University of Minnesota 2IBM Research EMAIL, diao EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 MAP Procedure Algorithm 2 Automatic Palette Adjustment via Interpolation Algorithm 3 Automatic Palette Adjustment via Greedy Search
Open Source Code Yes Our code is available at https://github.com/wang8740/MAP.
Open Datasets Yes We generate prompts from two data sources: Anthropic harmless data (Bai et al., 2022), which includes human requests delineated between the tags Human: and Assistant: , and IMDB data (Maas et al., 2011) from which we retain movie reviews exceeding 30 characters in length.
Dataset Splits No The paper mentions using a "test split of our task data" for evaluation but does not provide specific percentages or absolute counts for training, validation, and test datasets in the main text. It mentions creating a pilot dataset of n=2000 for quantile estimation, but this is not the main train/test/val split.
Hardware Specification Yes Our experiments were conducted using a single Nvidia A100 GPU.
Software Dependencies No In terms of model finetuning, we utilized the TRL package (von Werra et al., 2020) for DPO and PPO training. Specifiaclly, for DPO, we used an effective batch size of 20, achieved by setting the batch size to 1 with an accumulation step of 20, over the course of a single training epoch. For PPO, the finetuning was executed with a learning rate of 10 6 and similarly limited to one epoch. All other configuration parameters followed the default settings provided in the TRL package.
Experiment Setup Yes For data generation, we employed a top-k decoding approach with a fixed k = 50 and a limit of 50 new tokens per sequence. For DPO, we used an effective batch size of 20, achieved by setting the batch size to 1 with an accumulation step of 20, over the course of a single training epoch. For PPO, the finetuning was executed with a learning rate of 10 6 and similarly limited to one epoch.