reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Aligning LLMs by Predicting Preferences from User Writing Samples

Authors: Stéphane Aroca-Ouellette, Natalie Mackraz, Barry-John Theobald, Katherine Metcalf

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate PROSE with several LLMs (i.e., Qwen2.5 7B and 72B Instruct, GPT-mini, and GPT-4o) on a summarization and an email writing task. We find that PROSE more accurately infers nuanced human preferences, improving the quality of the writing agent s generations over CIPHER (a state-of-the-art method for inferring preferences) by 33%. Lastly, we demonstrate that ICL and PROSE are complementary methods, and combining them provides up to a 9% improvement over ICL alone.
Researcher Affiliation	Collaboration	Stéphane Aroca-Ouellette 1 2 Natalie Mackraz 3 Barry-John Theobald 3 Katherine Metcalf 3 1 Work done during internship. 2 Department of Computer Science, University of Colorado, Boulder, CO USA 3Apple, Cupertino, CA USA.
Pseudocode	Yes	The algorithm is provided in Appendix A, and the complete prompts are in Figure 8 (Appendix F.1)1. ... Algorithm 1 Assistant Task Completion ... Algorithm 2 PROSE: Preference Reasoning by Observing and Synthesizing Examples
Open Source Code	Yes	Code: https: //github.com/apple/ml-predict.
Open Datasets	Yes	We evaluate PROSE on PRELUDE (Gao et al., 2024), the assistive writing benchmark accompanying CIPHER, and identify several limitations.
Dataset Splits	Yes	The agent aligns itself with four (email) or five (summarization) users with five demonstrations per user. Performance is evaluated per task as the mean across all demonstrations, users, and task type. Each task is run over five seeds (standard error is reported over the seeds).
Hardware Specification	No	The paper mentions LLMs (Qwen2.5-7B-Instruct, Qwen2.5-72B-Instruct, GPT-4o-mini, and GPT-4o) used in the experiments and GPT-4o as a synthetic human, but no specific hardware (GPU, CPU, memory) used to run the experiments.
Software Dependencies	No	The paper mentions using specific LLM models (Qwen2.5-7B-Instruct, Qwen2.5-72B-Instruct, GPT-4o-mini, and GPT-4o) but does not provide any specific software dependencies like programming languages, libraries, or frameworks with their version numbers.
Experiment Setup	Yes	For all LLMs, S and v are determined via a hyperparameter sweep over v 0, 0.25, 0.5, 0.75, 1 and S 2, 3, 4, 5. In our experiments S = 5 for all LLMs, and v = 0.25 for Qwen2.5-7B-Instruct, v = 0.5 for both GPT-4omodels, and v = 0.75 for Qwen2.5-72B-Instruct.