The RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers
Authors: Hussein Mozannar, Valerie Chen, Mohammed Alsobay, Subhro Das, Sebastian Zhao, Dennis Wei, Manish Nagireddy, Prasanna Sattigeri, Ameet Talwalkar, David Sontag
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted a user study (N=243) using Real Human Eval in which users interacted with seven LLMs of varying base model performance. |
| Researcher Affiliation | Collaboration | Hussein Mozannar* EMAIL Microsoft Research Valerie Chen* EMAIL Carnegie Mellon University Mohammed Alsobay EMAIL Massachusetts Institute of Technology Subhro Das EMAIL MIT-IBM Watson AI Lab IBM Research Sebastian Zhao EMAIL University of California, Berkeley Dennis Wei EMAIL MIT-IBM Watson AI Lab IBM Research Manish Nagireddy EMAIL MIT-IBM Watson AI Lab IBM Research Prasanna Sattigeri EMAIL MIT-IBM Watson AI Lab IBM Research Ameet Talwalkar EMAIL Carnegie Mellon University David Sontag EMAIL Massachusetts Institute of Technology |
| Pseudocode | No | The paper describes methods in narrative text and uses code examples within figures and appendices, but does not present any structured pseudocode or algorithm blocks with explicit labels like "Pseudocode" or "Algorithm". |
| Open Source Code | Yes | We open-source Real Human Eval to enable human-centric evaluation of new models and the study data to facilitate efforts to improve code models. |
| Open Datasets | Yes | We open-source Real Human Eval to enable human-centric evaluation of new models and the study data to facilitate efforts to improve code models. ... We release the dataset of interactions collected from this study to guide the development of better coding assistants. |
| Dataset Splits | No | We designed 17 coding tasks for the platform that can be categorized into three categories... These 17 tasks are distributed into five sets, where each set consists of a different mix of task types in varying orders but shares the first two tasks. Each participant is randomly assigned to one of these sets. While this describes the distribution of tasks for the user study, it does not constitute traditional train/test/validation dataset splits used for model training or evaluation as typically described in ML papers. |
| Hardware Specification | No | The paper states that Real Human Eval eliminates "the need for participants to perform any additional installation of a bespoke IDE or study-specific extension or to have access to special hardware to serve study-specific models." It also mentions that the interface "supports any LLM invoked via an online API." This indicates that the authors are using external LLM services, and no specific hardware details for running the experiments or the platform itself are provided. |
| Software Dependencies | No | The paper mentions that participants will be "writing code in Python only and use only standard python libraries and only numpy and pandas." This refers to the environment for the study participants and the tasks, not specific software dependencies with version numbers for the Real Human Eval platform or the authors' analysis, beyond general programming language and libraries. No specific library versions are provided. |
| Experiment Setup | Yes | Each participant was assigned to one of seven conditions: a control condition with no LLM support, three conditions with autocomplete support from either Code Llama-7b (Rozière et al., 2023), Code Llama-34b (Rozière et al., 2023), or GPT-3.5-turbo-instruct(Brown et al., 2020), and finally three conditions where the editor is equipped with a chat window powered by the chat variants of the previous models in addition to GPT-4o (Open AI, 2022). ... Participants are given 35 minutes to complete as many tasks as possible. If 10 minutes pass and the participant has not completed the task, a button appears to provide the option to skip the task. ... For all LLMs, we used a temperature setting of 1 to ensure varied responses. For autocomplete LLMs, ... a token length of 64 made the suggestions more likely to be correct while not being too short. ... suggestion length random (truncated Gaussian) on the interval [10,120] with mean 64 ... For the chat LLMs, we set the max_token parameter to 512 tokens constant. |