reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Eliciting Human Preferences with Language Models

Authors: Belinda Li, Alex Tamkin, Noah Goodman, Jacob Andreas

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In preregistered experiments, we show that LMs that learn to perform these tasks using GATE (by interactively querying users with open-ended questions) obtain preference specifications that are more informative than userwritten prompts or examples. GATE matches existing task specification methods in the moral reasoning task, and significantly outperforms them in the content recommendation and email validation tasks.
Researcher Affiliation	Collaboration	Belinda Z. Li MIT CSAIL EMAIL Alex Tamkin Anthropic EMAIL Noah D. Goodman Stanford EMAIL Jacob Andreas MIT CSAIL EMAIL
Pseudocode	No	The paper describes the GATE framework and various methods but does not include any explicitly labeled pseudocode or algorithm blocks. The methods are explained in natural language and through diagrams like Figure 1 and Figure 2.
Open Source Code	Yes	1Code is ava/ilable at https://github.com/alextamkin/generative-elicitation
Open Datasets	Yes	We use the Microsoft News Dataset (Wu et al., 2020) as our pool for this domain, a dataset of 160k news articles with descriptions. The license terms for research use of this dataset can be found at https://github.com/msnews/MIND/blob/master/MSRLicense_Data.pdf.
Dataset Splits	No	The paper describes the composition and size of the test sets used for evaluation in each domain (e.g., "16 popular online newspaper and magazine articles as test cases", "28 moral scenarios", "set of 54 test cases"). However, it does not provide specific training/validation/test splits for any dataset in the context of model training, as the models used (GPT-4, Mixtral) are pre-trained LMs and the focus is on preference elicitation rather than traditional model training with fixed splits.
Hardware Specification	No	The paper uses GPT-4 and Mixtral models but does not specify the hardware (e.g., GPU models, CPU types, or cloud computing instances with detailed specifications) on which these models were run for the experiments.
Software Dependencies	Yes	We use the GPT-4 model (gpt-4-0613 snapshot; Open AI, 2023) to both elicit user preferences (as an elicitation policy E) and make predictions based on the elicited preferences (as a predictor ˆf(s)). We additionally run experiments on Mixtral, an open-source LM, in Appendix C.4.
Experiment Setup	Yes	We allow each human user (H in Eq. (2)) to interact open-endedly with an elicitation policy E for five minutes, resulting in a specification s. In all cases, we queried GPT-4 (or Mixtral) with temperature 0 for replicability of experiments.