MAP: Multi-Human-Value Alignment Palette
Authors: Xinran Wang, Qi Le, Ammar Ahmed, Enmao Diao, Yi Zhou, Nathalie Baracaldo, Jie Ding, Ali Anwar
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate MAP s ability to align multiple values in a principled manner while delivering strong empirical performance across various tasks. 3 EXPERIMENTAL STUDY |
| Researcher Affiliation | Collaboration | 1University of Minnesota 2IBM Research EMAIL, diao EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 MAP Procedure Algorithm 2 Automatic Palette Adjustment via Interpolation Algorithm 3 Automatic Palette Adjustment via Greedy Search |
| Open Source Code | Yes | Our code is available at https://github.com/wang8740/MAP. |
| Open Datasets | Yes | We generate prompts from two data sources: Anthropic harmless data (Bai et al., 2022), which includes human requests delineated between the tags Human: and Assistant: , and IMDB data (Maas et al., 2011) from which we retain movie reviews exceeding 30 characters in length. |
| Dataset Splits | No | The paper mentions using a "test split of our task data" for evaluation but does not provide specific percentages or absolute counts for training, validation, and test datasets in the main text. It mentions creating a pilot dataset of n=2000 for quantile estimation, but this is not the main train/test/val split. |
| Hardware Specification | Yes | Our experiments were conducted using a single Nvidia A100 GPU. |
| Software Dependencies | No | In terms of model finetuning, we utilized the TRL package (von Werra et al., 2020) for DPO and PPO training. Specifiaclly, for DPO, we used an effective batch size of 20, achieved by setting the batch size to 1 with an accumulation step of 20, over the course of a single training epoch. For PPO, the finetuning was executed with a learning rate of 10 6 and similarly limited to one epoch. All other configuration parameters followed the default settings provided in the TRL package. |
| Experiment Setup | Yes | For data generation, we employed a top-k decoding approach with a fixed k = 50 and a limit of 50 new tokens per sequence. For DPO, we used an effective batch size of 20, achieved by setting the batch size to 1 with an accumulation step of 20, over the course of a single training epoch. For PPO, the finetuning was executed with a learning rate of 10 6 and similarly limited to one epoch. |