reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Undesirable Biases in NLP: Addressing Challenges of Measurement

Authors: Oskar van der Wal, Dominik Bachmann, Alina Leidinger, Leendert van Maanen, Willem Zuidema, Katrin Schulz

JAIR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this paper, we provide an interdisciplinary approach to discussing the issue of NLP model bias by adopting the lens of psychometrics a field specialized in the measurement of concepts like bias that are not directly observable. In particular, we will explore two central notions from psychometrics, the construct validity and the reliability of measurement tools, and discuss how they can be applied in the context of measuring model bias. Our goal is to provide NLP practitioners with methodological tools for designing better bias measures, and to inspire them more generally to explore tools from psychometrics when working on bias measurement tools.
Researcher Affiliation	Academia	Oskar van der Wal EMAIL Institute for Logic, Language and Computation, University of Amsterdam Dominik Bachmann EMAIL Institute for Logic, Language and Computation, University of Amsterdam Department of Experimental Psychology, Utrecht University Alina Leidinger EMAIL Institute for Logic, Language and Computation, University of Amsterdam Leendert van Maanen EMAIL Department of Experimental Psychology, Utrecht University Willem Zuidema EMAIL Katrin Schulz EMAIL Institute for Logic, Language and Computation, University of Amsterdam
Pseudocode	No	The paper discusses psychometric concepts and their application to NLP bias measurement, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper presents a theoretical framework and discussion for evaluating bias measures in NLP and does not describe a novel computational methodology for which source code would be released.
Open Datasets	No	The paper discusses various existing bias measures and their use of benchmark datasets (e.g., Crow S-Pairs, STS-B for genders, Wino Bias) as examples within its conceptual framework, but it does not conduct new experiments using a specific dataset or release a new open dataset.
Dataset Splits	No	The paper provides a conceptual framework for evaluating bias measures and does not report on original experimental results involving dataset splits.
Hardware Specification	No	The paper focuses on a theoretical and methodological discussion of bias measurement and does not describe the hardware used for experimental runs.
Software Dependencies	No	The paper is a conceptual work outlining a framework for evaluating bias measures and does not specify software dependencies with version numbers for any implemented methods.
Experiment Setup	No	The paper provides a conceptual framework and discussion regarding bias measurement in NLP and does not detail a specific experimental setup, hyperparameters, or training configurations.