reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PAL: Sample-Efficient Personalized Reward Modeling for Pluralistic Alignment

Authors: Daiwei Chen, Yi Chen, Aniket Rege, Zhi Wang, Ramya Vinayak

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive empirical evaluation demonstrates that PAL matches or outperforms state-of-the-art methods on both text-to-text and text-to-image tasks: on Reddit TL;DR Summary, PAL is 1.7% more accurate for seen users and 36% more accurate for unseen users compared to the previous best method, with 100 less parameters. On Pick-a-Pic v2, PAL is 2.5% more accurate than the best method with 156 fewer learned parameters. Finally, we provide theoretical analysis for generalization of rewards learned via PAL showcasing the reduction in number of samples needed per user.
Researcher Affiliation	Academia	Daiwei Chen, Yi Chen, Aniket Rege, Zhi Wang, and Ramya Korlakai Vinayak University of Wisconsin Madison, Madison, WI, USA
Pseudocode	Yes	See Appendix B for pseudocode details. ... Algorithm 1 Learning algorithm for PAL-A ... Algorithm 2 Learning algorithm for PAL-B
Open Source Code	Yes	Our code is publicly available at https://github.com/Ramya Lab/pluralistic-alignment.
Open Datasets	Yes	Reddit TL;DR Summary dataset curated by Stiennon et al. (2020) contains a series of preferences over summaries generated by language models. ... The Pick-a-Pic dataset (Kirstain et al., 2024) is a large, open dataset for human feedback in T2I generation.
Dataset Splits	Yes	This processed dataset contains 20,969 training samples, 2,330 validation samples, and 4,921 test samples. ... Given the varying numbers of few-shot samples in the training set, we partitioned each user s comparison pairs into training, validation, and test sets, resulting in multiple datasets. ... Table D.6: Number of samples in each split of the newly constructed Pick-a-Filter dataset.
Hardware Specification	Yes	PAL-B-Tiny ( 6M params) exceeds So TA performance while training on a single RTX 4090 GPU (see Appendix E)... We conducted most of our experiments using 4 RTX 4090, each with 24 GB of VRAM. For the experiments involving a foundation model that has 1.3B parameters or more, we used 2 A100, each with 80GB of VRAM.
Software Dependencies	No	The paper mentions software components like "Adam optimizer", "OPT-350M", "Distil BERT", "BGE-M3", "gte-Qwen2-1.5B", "CLIP-H/14", "PyTorch", but does not specify their version numbers.
Experiment Setup	Yes	Details of the loss function, hyperparameter setting, unseen dataset, and training setup are deferred to Appendix D.3. ... Table D.2: The training hyperparameter setting of PAL reward modeling on Reddit TL;DR. K 2 Batch size 4 Projectors mlp-2layer-gelu-dropout0 Epoch 1 Learning rate of LLM 9.65e-6 Learning rate of projectors 1e-4 Learning rate of user weights 5e-3 Weight decay of LLM 0.0 Weight decay of projectors 0.01 Weight decay of user weights 0.0 Loss weighting cumulative Dimension of preference embedding 512 End of conversation token <\|endoftext\|> Maximum sequence length 600