Evaluating Human-Language Model Interaction

Authors: Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, Rose E Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi Bommasani, Michael S. Bernstein, Percy Liang

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of Open AI s GPT-3 and AI21 Labs Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation. From the 1015 interaction traces we collected (in English), we observe that better non-interactive performance does not always lead to better human-LM interaction
Researcher Affiliation Academia Stanford University Imperial College London
Pseudocode Yes Algorithm 1: Generate an interaction trace s0 task-specific contents for t = 1, 2, ... do User takes an action at if at finishes the interaction then System updates the state st+1 Transition(st, at) end return [(s1, a1), (s2, a2), ...]
Open Source Code Yes We release our interaction traces, their replay links, and system interfaces at https://github.com/stanford-crfm/halie.
Open Datasets Yes We randomly select ten scenarios from the Empathetic Dialogues (Rashkin et al., 2019) and Commonsense Dialogues datasets (Zhou et al., 2021) after manually filtering out potentially sensitive and upsetting scenarios from the validation and test sets of the datasets. We use questions from the Measuring Massive Multitask Language Understanding (MMLU) dataset Hendrycks et al. (2021). We randomly select 964 documents from XSum dataset Narayan et al. (2018)
Dataset Splits Yes Concretely, we first chose five diverse subjects from the dataset (Global facts, Nutrition, US foreign policy, College chemistry, and Miscellany), and selected 6 questions from each subject to construct a pool of 30 questions. We constructed quizzes by randomly selecting ten questions from the pool and adding one attention check question in the middle. We randomly select 964 documents from XSum dataset Narayan et al. (2018) and construct 20 summarization sessions by randomly choosing ten documents per session without replacement.
Hardware Specification No The paper mentions typical latency for the LMs used (e.g., 0.12s for Text Davinci) and discusses 'allocated compute resources' in general terms, but does not provide specific hardware models (e.g., GPU/CPU types, memory) used for their experiments.
Software Dependencies Yes For these calculations, we used both the Python scipy and R stats packages Virtanen et al. (2020); R Core Team (2020).
Experiment Setup Yes In the dialogue system, Create Prompt creates a prompt by concatenating four example dialogues for in-context learning and the current dialogue history. While doing so, we omit the scenario information as the scenario is only known to the user. For Query LM, we use top_k = 50 and temperature = 0.9 for decoding parameters and use HTML tags to delineate a conversation and the turns within. In the QA system, Create Prompt creates a prompt by simply copying and pasting user input from the interface. Note that we do not include a multiple-choice question as part of the prompt. For Query LM, we use temperature = 0.5 and max_tokens = 100 for decoding parameters.