No Preference Left Behind: Group Distributional Preference Optimization

Authors: Binwei Yao, Zefan Cai, Yun-Shiuan Chuang, Shanglin Yang, Ming Jiang, Diyi Yang, Junjie Hu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments using both synthetic controllable opinion generation and real-world movie review datasets, we show that DPO fails to align with the targeted belief distributions, while GDPO consistently reduces this alignment gap during training. Moreover, our evaluation metrics demonstrate that GDPO outperforms existing approaches in aligning with group distributional preferences, marking a significant advance in pluralistic alignment.
Researcher Affiliation Academia 1University of Wisconsin-Madison, 2Indiana University Indianapolis, 3Stanford University EMAIL
Pseudocode No The paper describes steps in regular paragraph text and mathematical formulations (e.g., Eq. (4), (6), (7)) but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Our data and code are released at https://github.com/Big Binnie/GDPO.
Open Datasets Yes We construct the synthetic dataset to simulate one-turn dialogues that reflect diverse opinions from various countries, using Global Opinion QA (Durmus et al., 2023), a multi-choice question-answer dataset focused on global issues. For the controllable review generation, we use movie reviews from Amazon users to build the preference dataset. We use movie reviews written by users from the Amazon Movie Review dataset1 to create the controllable movie review generation dataset. 1https://snap.stanford.edu/data/web-Amazon.html
Dataset Splits Yes Table 1: Dataset Statistics of Controllable Opinion Generation: The number following the country name is the sum of questions in Global Opinion QA used to generate dialogues. Split Unite States (469) Pakistan (219) S.Africa (162) small large small large small large Train 14,321 176,905 6,684 80,364 4,960 54,896 Eval 1,843 22,166 860 10,070 636 6,878 Test 1,843 22,199 876 10,086 648 6,890
Hardware Specification Yes We train GPT-2 Large with a total batch size of 128 and 40 epochs, distributed over 4 A5000 GPUs in SFT. Gradients are accumulated over 2 steps. Then, we train GPT-2 Large with a total batch size of 32 with 20 epochs, distributed across 4 A5000 GPUs in DPO and GDPO...We train Pythia-2.8B with a total batch size of 128 and 40 epochs, distributed over 4 A40 GPUs in SFT.
Software Dependencies No The paper mentions software components like 'RMSprop' and 'Adam' (optimizers) and 'GPT-3.5-turbo' (for data generation), but it does not specify any version numbers for these or other software libraries or programming languages used in the implementation or experimentation.
Experiment Setup Yes We train GPT-2 Large with a total batch size of 128 and 40 epochs, distributed over 4 A5000 GPUs in SFT. Gradients are accumulated over 2 steps. Then, we train GPT-2 Large with a total batch size of 32 with 20 epochs, distributed across 4 A5000 GPUs in DPO and GDPO, and gradients are accumulated over 8 steps to effectively reduce memory requirements and ensure fair comparison...We set β of DPO and GDPO to 0.1. The data type is set to bfloat16. The optimizer used is RMSprop...The learning rate is initialized to 5e-7 with a linear warmup for the first 150 steps. For every 10000 steps, we evaluate the model on the validation set.