reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge

Authors: Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas J Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this position paper, we argue that the ML community would benefit from learning from and drawing on the social sciences when developing and using measurement instruments for evaluating Gen AI systems. Specifically, our position is that evaluating Gen AI systems is a social science measurement challenge. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of Gen AI systems.
Researcher Affiliation	Collaboration	Hanna Wallach 1 Meera Desai 2 A. Feder Cooper 1 Angelina Wang 3 Chad Atalla 1 Solon Barocas 1 Su Lin Blodgett 1 Alexandra Chouldechova 1 Emily Corvi 1 P. Alex Dow 1 Jean Garcia-Gathright 1 Alexandra Olteanu 1 Nicholas Pangakis 1 Stefanie Reed 1 Emily Sheng 1 Dan Vann 1 Jennifer Wortman Vaughan 1 Matthew Vogel 1 Hannah Washington 1 Abigail Z. Jacobs 2 1Microsoft Research 2University of Michigan 3Stanford University. Correspondence to: Hanna Wallach <EMAIL>.
Pseudocode	No	The paper presents a conceptual framework and discusses processes (systematization, operationalization, application, interrogation) in descriptive text and a diagram (Figure 1), but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper is a position paper proposing a framework for evaluating Gen AI systems. It does not present new methodology that would require source code release.
Open Datasets	No	The paper is theoretical and does not conduct experiments using datasets. It refers to existing benchmarks and datasets in hypothetical examples (e.g., "International Math Olympiad problems" as a measurement instrument for a hypothetical task), but these are not used by the authors for their own experimental work in this paper.
Dataset Splits	No	The paper is theoretical and does not conduct experiments, therefore, it does not provide specific dataset split information for reproduction.
Hardware Specification	No	The paper is theoretical and does not conduct experiments, thus no hardware specifications are provided.
Software Dependencies	No	The paper is theoretical and does not conduct experiments. Therefore, it does not list specific software dependencies with version numbers for reproducing experimental results.
Experiment Setup	No	The paper presents a theoretical framework and does not describe any experiments. As such, it does not provide specific experimental setup details or hyperparameters.