Aligning LLMs by Predicting Preferences from User Writing Samples

Authors: Stéphane Aroca-Ouellette, Natalie Mackraz, Barry-John Theobald, Katherine Metcalf

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate PROSE with several LLMs (i.e., Qwen2.5 7B and 72B Instruct, GPT-mini, and GPT-4o) on a summarization and an email writing task. We find that PROSE more accurately infers nuanced human preferences, improving the quality of the writing agent s generations over CIPHER (a state-of-the-art method for inferring preferences) by 33%. Lastly, we demonstrate that ICL and PROSE are complementary methods, and combining them provides up to a 9% improvement over ICL alone.
Researcher Affiliation Collaboration Stéphane Aroca-Ouellette 1 2 Natalie Mackraz 3 Barry-John Theobald 3 Katherine Metcalf 3 1 Work done during internship. 2 Department of Computer Science, University of Colorado, Boulder, CO USA 3Apple, Cupertino, CA USA.
Pseudocode Yes The algorithm is provided in Appendix A, and the complete prompts are in Figure 8 (Appendix F.1)1. ... Algorithm 1 Assistant Task Completion ... Algorithm 2 PROSE: Preference Reasoning by Observing and Synthesizing Examples
Open Source Code Yes Code: https: //github.com/apple/ml-predict.
Open Datasets Yes We evaluate PROSE on PRELUDE (Gao et al., 2024), the assistive writing benchmark accompanying CIPHER, and identify several limitations.
Dataset Splits Yes The agent aligns itself with four (email) or five (summarization) users with five demonstrations per user. Performance is evaluated per task as the mean across all demonstrations, users, and task type. Each task is run over five seeds (standard error is reported over the seeds).
Hardware Specification No The paper mentions LLMs (Qwen2.5-7B-Instruct, Qwen2.5-72B-Instruct, GPT-4o-mini, and GPT-4o) used in the experiments and GPT-4o as a synthetic human, but no specific hardware (GPU, CPU, memory) used to run the experiments.
Software Dependencies No The paper mentions using specific LLM models (Qwen2.5-7B-Instruct, Qwen2.5-72B-Instruct, GPT-4o-mini, and GPT-4o) but does not provide any specific software dependencies like programming languages, libraries, or frameworks with their version numbers.
Experiment Setup Yes For all LLMs, S and v are determined via a hyperparameter sweep over v 0, 0.25, 0.5, 0.75, 1 and S 2, 3, 4, 5. In our experiments S = 5 for all LLMs, and v = 0.25 for Qwen2.5-7B-Instruct, v = 0.5 for both GPT-4omodels, and v = 0.75 for Qwen2.5-72B-Instruct.