Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge
Authors: Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas J Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this position paper, we argue that the ML community would benefit from learning from and drawing on the social sciences when developing and using measurement instruments for evaluating Gen AI systems. Specifically, our position is that evaluating Gen AI systems is a social science measurement challenge. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of Gen AI systems. |
| Researcher Affiliation | Collaboration | Hanna Wallach 1 Meera Desai 2 A. Feder Cooper 1 Angelina Wang 3 Chad Atalla 1 Solon Barocas 1 Su Lin Blodgett 1 Alexandra Chouldechova 1 Emily Corvi 1 P. Alex Dow 1 Jean Garcia-Gathright 1 Alexandra Olteanu 1 Nicholas Pangakis 1 Stefanie Reed 1 Emily Sheng 1 Dan Vann 1 Jennifer Wortman Vaughan 1 Matthew Vogel 1 Hannah Washington 1 Abigail Z. Jacobs 2 1Microsoft Research 2University of Michigan 3Stanford University. Correspondence to: Hanna Wallach <EMAIL>. |
| Pseudocode | No | The paper presents a conceptual framework and discusses processes (systematization, operationalization, application, interrogation) in descriptive text and a diagram (Figure 1), but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper is a position paper proposing a framework for evaluating Gen AI systems. It does not present new methodology that would require source code release. |
| Open Datasets | No | The paper is theoretical and does not conduct experiments using datasets. It refers to existing benchmarks and datasets in hypothetical examples (e.g., "International Math Olympiad problems" as a measurement instrument for a hypothetical task), but these are not used by the authors for their own experimental work in this paper. |
| Dataset Splits | No | The paper is theoretical and does not conduct experiments, therefore, it does not provide specific dataset split information for reproduction. |
| Hardware Specification | No | The paper is theoretical and does not conduct experiments, thus no hardware specifications are provided. |
| Software Dependencies | No | The paper is theoretical and does not conduct experiments. Therefore, it does not list specific software dependencies with version numbers for reproducing experimental results. |
| Experiment Setup | No | The paper presents a theoretical framework and does not describe any experiments. As such, it does not provide specific experimental setup details or hyperparameters. |