reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Evaluating Human-Language Model Interaction

Authors: Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, Rose E Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi Bommasani, Michael S. Bernstein, Percy Liang

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of Open AI s GPT-3 and AI21 Labs Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation. From the 1015 interaction traces we collected (in English), we observe that better non-interactive performance does not always lead to better human-LM interaction
Researcher Affiliation	Academia	Stanford University Imperial College London
Pseudocode	Yes	Algorithm 1: Generate an interaction trace s0 task-specific contents for t = 1, 2, ... do User takes an action at if at finishes the interaction then System updates the state st+1 Transition(st, at) end return [(s1, a1), (s2, a2), ...]
Open Source Code	Yes	We release our interaction traces, their replay links, and system interfaces at https://github.com/stanford-crfm/halie.
Open Datasets	Yes	We randomly select ten scenarios from the Empathetic Dialogues (Rashkin et al., 2019) and Commonsense Dialogues datasets (Zhou et al., 2021) after manually filtering out potentially sensitive and upsetting scenarios from the validation and test sets of the datasets. We use questions from the Measuring Massive Multitask Language Understanding (MMLU) dataset Hendrycks et al. (2021). We randomly select 964 documents from XSum dataset Narayan et al. (2018)
Dataset Splits	Yes	Concretely, we first chose five diverse subjects from the dataset (Global facts, Nutrition, US foreign policy, College chemistry, and Miscellany), and selected 6 questions from each subject to construct a pool of 30 questions. We constructed quizzes by randomly selecting ten questions from the pool and adding one attention check question in the middle. We randomly select 964 documents from XSum dataset Narayan et al. (2018) and construct 20 summarization sessions by randomly choosing ten documents per session without replacement.
Hardware Specification	No	The paper mentions typical latency for the LMs used (e.g., 0.12s for Text Davinci) and discusses 'allocated compute resources' in general terms, but does not provide specific hardware models (e.g., GPU/CPU types, memory) used for their experiments.
Software Dependencies	Yes	For these calculations, we used both the Python scipy and R stats packages Virtanen et al. (2020); R Core Team (2020).
Experiment Setup	Yes	In the dialogue system, Create Prompt creates a prompt by concatenating four example dialogues for in-context learning and the current dialogue history. While doing so, we omit the scenario information as the scenario is only known to the user. For Query LM, we use top_k = 50 and temperature = 0.9 for decoding parameters and use HTML tags to delineate a conversation and the turns within. In the QA system, Create Prompt creates a prompt by simply copying and pasting user input from the interface. Note that we do not include a multiple-choice question as part of the prompt. For Query LM, we use temperature = 0.5 and max_tokens = 100 for decoding parameters.