reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers

Authors: Hussein Mozannar, Valerie Chen, Mohammed Alsobay, Subhro Das, Sebastian Zhao, Dennis Wei, Manish Nagireddy, Prasanna Sattigeri, Ameet Talwalkar, David Sontag

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted a user study (N=243) using Real Human Eval in which users interacted with seven LLMs of varying base model performance.
Researcher Affiliation	Collaboration	Hussein Mozannar* EMAIL Microsoft Research Valerie Chen* EMAIL Carnegie Mellon University Mohammed Alsobay EMAIL Massachusetts Institute of Technology Subhro Das EMAIL MIT-IBM Watson AI Lab IBM Research Sebastian Zhao EMAIL University of California, Berkeley Dennis Wei EMAIL MIT-IBM Watson AI Lab IBM Research Manish Nagireddy EMAIL MIT-IBM Watson AI Lab IBM Research Prasanna Sattigeri EMAIL MIT-IBM Watson AI Lab IBM Research Ameet Talwalkar EMAIL Carnegie Mellon University David Sontag EMAIL Massachusetts Institute of Technology
Pseudocode	No	The paper describes methods in narrative text and uses code examples within figures and appendices, but does not present any structured pseudocode or algorithm blocks with explicit labels like "Pseudocode" or "Algorithm".
Open Source Code	Yes	We open-source Real Human Eval to enable human-centric evaluation of new models and the study data to facilitate efforts to improve code models.
Open Datasets	Yes	We open-source Real Human Eval to enable human-centric evaluation of new models and the study data to facilitate efforts to improve code models. ... We release the dataset of interactions collected from this study to guide the development of better coding assistants.
Dataset Splits	No	We designed 17 coding tasks for the platform that can be categorized into three categories... These 17 tasks are distributed into five sets, where each set consists of a different mix of task types in varying orders but shares the first two tasks. Each participant is randomly assigned to one of these sets. While this describes the distribution of tasks for the user study, it does not constitute traditional train/test/validation dataset splits used for model training or evaluation as typically described in ML papers.
Hardware Specification	No	The paper states that Real Human Eval eliminates "the need for participants to perform any additional installation of a bespoke IDE or study-specific extension or to have access to special hardware to serve study-specific models." It also mentions that the interface "supports any LLM invoked via an online API." This indicates that the authors are using external LLM services, and no specific hardware details for running the experiments or the platform itself are provided.
Software Dependencies	No	The paper mentions that participants will be "writing code in Python only and use only standard python libraries and only numpy and pandas." This refers to the environment for the study participants and the tasks, not specific software dependencies with version numbers for the Real Human Eval platform or the authors' analysis, beyond general programming language and libraries. No specific library versions are provided.
Experiment Setup	Yes	Each participant was assigned to one of seven conditions: a control condition with no LLM support, three conditions with autocomplete support from either Code Llama-7b (Rozière et al., 2023), Code Llama-34b (Rozière et al., 2023), or GPT-3.5-turbo-instruct(Brown et al., 2020), and finally three conditions where the editor is equipped with a chat window powered by the chat variants of the previous models in addition to GPT-4o (Open AI, 2022). ... Participants are given 35 minutes to complete as many tasks as possible. If 10 minutes pass and the participant has not completed the task, a button appears to provide the option to skip the task. ... For all LLMs, we used a temperature setting of 1 to ensure varied responses. For autocomplete LLMs, ... a token length of 64 made the suggestions more likely to be correct while not being too short. ... suggestion length random (truncated Gaussian) on the interval [10,120] with mean 64 ... For the chat LLMs, we set the max_token parameter to 512 tokens constant.