Diverse Preference Learning for Capabilities and Alignment
Authors: Stewart Slocum, Asher Parker-Sartori, Dylan Hadfield-Menell
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we consider four experimental settings for evaluating our algorithm. First, we show that in general-purpose chat domains, SPL allows for increased diversity with less performance degradation than DPO with token-level temperature scaling. Second, we consider an application of high-temperature generation in best-of-N problem-solving settings. Finally, we evaluate SPL s logit calibration, finding reduced overconfidence and improved calibration on standard multiple-choice benchmarks. Figure 2: Improved diversity-quality tradeoffs with SPL. |
| Researcher Affiliation | Academia | Stewart Slocum, Asher Parker-Sartori, and Dylan Hadfield-Menell MIT CSAIL EMAIL |
| Pseudocode | No | The paper describes its methodology through mathematical formulations and theoretical analysis in Section 3 and Appendix A, but does not include any explicitly labeled pseudocode or algorithm blocks with structured, step-by-step procedures. |
| Open Source Code | No | The paper does not explicitly state that source code is provided, nor does it include any links to a code repository. |
| Open Datasets | Yes | We train on the HH-RLHF preference dataset for 5,000 steps (details in Appendix C.4) (Bai et al., 2022). For these experiments, we Lo RA finetune Mistral-7B-Instruct-v0.2 with DPO and SPL (Rafailov et al., 2024; Hu et al., 2021). We apply DPO and SPL to a Mistral-7B base model (Hugging Face, 2023) trained with supervised fine-tuning on the Ultra Chat dataset (Ding et al., 2023). This approach trains on the Ultrafeedback-200k dataset, a large preference dataset covering a broad suite of chat and reasoning tasks (Cui et al., 2024). We evaluate against two mathematical reasoning datasets: the GSM8K grade-school math dataset (Cobbe et al., 2021) and the more challenging MATH dataset (Hendrycks et al., 2021b). We evaluate against two standard multiple-choice datasets: Truthful QA and MMLU. Truthful QA is a benchmark designed to assess a model s ability to provide truthful answers in contexts where misconceptions are prevalent (Lin et al., 2022). MMLU tests a model s knowledge and reasoning across 57 diverse subjects, from philosophy to abstract algebra (Hendrycks et al., 2021a). |
| Dataset Splits | Yes | At inference time, we sample 500 prompts from a held-out validation split of HH-RLHF that neither the language models nor the reward model were trained on. We sample 128 completions on a random split of 200 problems from each dataset. We also divide problems into Easy, Medium, and Hard categories. For MATH, this corresponds to Level 1, Level 3, and Level 5 problems. For GSM8K, we run our evaluation on Mistral-Instruct-7B and group problems as easy if they take 4 or fewer samples to solve, medium if they take 5-64 samples to solve, and hard if they take more than 64 samples to solve. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running its experiments. |
| Software Dependencies | No | The paper mentions several models and tools like Mistral-7B-Instruct-v0.2, Sentence-BERT-Large, Open AI's text-embedding-3-small, and gpt-4o-mini-2024-07-18, but does not provide specific versions for ancillary software libraries or frameworks (e.g., Python, PyTorch, Transformers). |
| Experiment Setup | Yes | For both DPO and SPL, we Lo RA finetune Mistral-7B-Instruct-v0.2 on HH-RLHF for 5,000 steps with batch size 8. We use Lo RA rank r Lo RA = 16, regularization αLo RA = 16, and dropout p Lo RA = 0.05. We use learning rate 1e 5, 150 warmup steps, and max conversation length of 512 tokens. For all runs, we use regularization parameter β = 0.1. For both DPO and SPL, our base model is a Mistral-7B base model that has been full-parameter supervised fine-tuned on the Ultra Chat dataset. We then Lo RA-finetune this model on Ultrafeedback-200k for one epoch. We use batch size 4, Lo RA rank r Lo RA = 64, regularization αLo RA = 64, and dropout p Lo RA = 0.05. We use learning rate 1e 5, 150 warmup steps, and max conversation length of 1024 tokens. |