Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge

Authors: Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas J Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this position paper, we argue that the ML community would benefit from learning from and drawing on the social sciences when developing and using measurement instruments for evaluating Gen AI systems. Specifically, our position is that evaluating Gen AI systems is a social science measurement challenge. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of Gen AI systems.
Researcher Affiliation Collaboration Hanna Wallach 1 Meera Desai 2 A. Feder Cooper 1 Angelina Wang 3 Chad Atalla 1 Solon Barocas 1 Su Lin Blodgett 1 Alexandra Chouldechova 1 Emily Corvi 1 P. Alex Dow 1 Jean Garcia-Gathright 1 Alexandra Olteanu 1 Nicholas Pangakis 1 Stefanie Reed 1 Emily Sheng 1 Dan Vann 1 Jennifer Wortman Vaughan 1 Matthew Vogel 1 Hannah Washington 1 Abigail Z. Jacobs 2 1Microsoft Research 2University of Michigan 3Stanford University. Correspondence to: Hanna Wallach <EMAIL>.
Pseudocode No The paper presents a conceptual framework and discusses processes (systematization, operationalization, application, interrogation) in descriptive text and a diagram (Figure 1), but does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper is a position paper proposing a framework for evaluating Gen AI systems. It does not present new methodology that would require source code release.
Open Datasets No The paper is theoretical and does not conduct experiments using datasets. It refers to existing benchmarks and datasets in hypothetical examples (e.g., "International Math Olympiad problems" as a measurement instrument for a hypothetical task), but these are not used by the authors for their own experimental work in this paper.
Dataset Splits No The paper is theoretical and does not conduct experiments, therefore, it does not provide specific dataset split information for reproduction.
Hardware Specification No The paper is theoretical and does not conduct experiments, thus no hardware specifications are provided.
Software Dependencies No The paper is theoretical and does not conduct experiments. Therefore, it does not list specific software dependencies with version numbers for reproducing experimental results.
Experiment Setup No The paper presents a theoretical framework and does not describe any experiments. As such, it does not provide specific experimental setup details or hyperparameters.