Aligning LLMs by Predicting Preferences from User Writing Samples
Authors: Stéphane Aroca-Ouellette, Natalie Mackraz, Barry-John Theobald, Katherine Metcalf
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate PROSE with several LLMs (i.e., Qwen2.5 7B and 72B Instruct, GPT-mini, and GPT-4o) on a summarization and an email writing task. We find that PROSE more accurately infers nuanced human preferences, improving the quality of the writing agent s generations over CIPHER (a state-of-the-art method for inferring preferences) by 33%. Lastly, we demonstrate that ICL and PROSE are complementary methods, and combining them provides up to a 9% improvement over ICL alone. |
| Researcher Affiliation | Collaboration | Stéphane Aroca-Ouellette 1 2 Natalie Mackraz 3 Barry-John Theobald 3 Katherine Metcalf 3 1 Work done during internship. 2 Department of Computer Science, University of Colorado, Boulder, CO USA 3Apple, Cupertino, CA USA. |
| Pseudocode | Yes | The algorithm is provided in Appendix A, and the complete prompts are in Figure 8 (Appendix F.1)1. ... Algorithm 1 Assistant Task Completion ... Algorithm 2 PROSE: Preference Reasoning by Observing and Synthesizing Examples |
| Open Source Code | Yes | Code: https: //github.com/apple/ml-predict. |
| Open Datasets | Yes | We evaluate PROSE on PRELUDE (Gao et al., 2024), the assistive writing benchmark accompanying CIPHER, and identify several limitations. |
| Dataset Splits | Yes | The agent aligns itself with four (email) or five (summarization) users with five demonstrations per user. Performance is evaluated per task as the mean across all demonstrations, users, and task type. Each task is run over five seeds (standard error is reported over the seeds). |
| Hardware Specification | No | The paper mentions LLMs (Qwen2.5-7B-Instruct, Qwen2.5-72B-Instruct, GPT-4o-mini, and GPT-4o) used in the experiments and GPT-4o as a synthetic human, but no specific hardware (GPU, CPU, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions using specific LLM models (Qwen2.5-7B-Instruct, Qwen2.5-72B-Instruct, GPT-4o-mini, and GPT-4o) but does not provide any specific software dependencies like programming languages, libraries, or frameworks with their version numbers. |
| Experiment Setup | Yes | For all LLMs, S and v are determined via a hyperparameter sweep over v 0, 0.25, 0.5, 0.75, 1 and S 2, 3, 4, 5. In our experiments S = 5 for all LLMs, and v = 0.25 for Qwen2.5-7B-Instruct, v = 0.5 for both GPT-4omodels, and v = 0.75 for Qwen2.5-72B-Instruct. |