PAL: Sample-Efficient Personalized Reward Modeling for Pluralistic Alignment
Authors: Daiwei Chen, Yi Chen, Aniket Rege, Zhi Wang, Ramya Vinayak
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive empirical evaluation demonstrates that PAL matches or outperforms state-of-the-art methods on both text-to-text and text-to-image tasks: on Reddit TL;DR Summary, PAL is 1.7% more accurate for seen users and 36% more accurate for unseen users compared to the previous best method, with 100 less parameters. On Pick-a-Pic v2, PAL is 2.5% more accurate than the best method with 156 fewer learned parameters. Finally, we provide theoretical analysis for generalization of rewards learned via PAL showcasing the reduction in number of samples needed per user. |
| Researcher Affiliation | Academia | Daiwei Chen, Yi Chen, Aniket Rege, Zhi Wang, and Ramya Korlakai Vinayak University of Wisconsin Madison, Madison, WI, USA |
| Pseudocode | Yes | See Appendix B for pseudocode details. ... Algorithm 1 Learning algorithm for PAL-A ... Algorithm 2 Learning algorithm for PAL-B |
| Open Source Code | Yes | Our code is publicly available at https://github.com/Ramya Lab/pluralistic-alignment. |
| Open Datasets | Yes | Reddit TL;DR Summary dataset curated by Stiennon et al. (2020) contains a series of preferences over summaries generated by language models. ... The Pick-a-Pic dataset (Kirstain et al., 2024) is a large, open dataset for human feedback in T2I generation. |
| Dataset Splits | Yes | This processed dataset contains 20,969 training samples, 2,330 validation samples, and 4,921 test samples. ... Given the varying numbers of few-shot samples in the training set, we partitioned each user s comparison pairs into training, validation, and test sets, resulting in multiple datasets. ... Table D.6: Number of samples in each split of the newly constructed Pick-a-Filter dataset. |
| Hardware Specification | Yes | PAL-B-Tiny ( 6M params) exceeds So TA performance while training on a single RTX 4090 GPU (see Appendix E)... We conducted most of our experiments using 4 RTX 4090, each with 24 GB of VRAM. For the experiments involving a foundation model that has 1.3B parameters or more, we used 2 A100, each with 80GB of VRAM. |
| Software Dependencies | No | The paper mentions software components like "Adam optimizer", "OPT-350M", "Distil BERT", "BGE-M3", "gte-Qwen2-1.5B", "CLIP-H/14", "PyTorch", but does not specify their version numbers. |
| Experiment Setup | Yes | Details of the loss function, hyperparameter setting, unseen dataset, and training setup are deferred to Appendix D.3. ... Table D.2: The training hyperparameter setting of PAL reward modeling on Reddit TL;DR. K 2 Batch size 4 Projectors mlp-2layer-gelu-dropout0 Epoch 1 Learning rate of LLM 9.65e-6 Learning rate of projectors 1e-4 Learning rate of user weights 5e-3 Weight decay of LLM 0.0 Weight decay of projectors 0.01 Weight decay of user weights 0.0 Loss weighting cumulative Dimension of preference embedding 512 End of conversation token <|endoftext|> Maximum sequence length 600 |