reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

First-Person Fairness in Chatbots

Authors: Tyna Eloundou, Alex Beutel, David Robinson, Keren Gu, Anna-Luisa Brakman, Pamela Mishkin, Meghan Shah, Johannes Heidecke, Lilian Weng, Adam Tauman Kalai

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply this approach to assess biases in six of our language models across millions of interactions, covering sixty-six tasks in nine domains and spanning two genders and four races. Independent human annotations corroborate the LMRA-generated bias evaluations. This study represents the first large-scale fairness evaluation based on real-world chat data.
Researcher Affiliation	Industry	Open AI EMAIL
Pseudocode	Yes	The Bias Enumeration Algorithm (full details in Algorithm 1 of Appendix B) has four steps:
Open Source Code	No	While our specific results are not directly reproducible due to privacy, our approach is methodologically replicable meaning that it can be applied to any name-sensitive chatbot and be used to monitor for bias in deployed systems. Prompts used in this work are provided and Appendix M gives instructions on how to use the API to simulate Chat GPT behavior with arbitrary CI, facilitating future research on chatbot fairness.
Open Datasets	Yes	Examples published in this work and shown to crowd workers are drawn from two chat datasets that are open and publicly available: LMSYS (Zheng et al., 2023) and Wild Chat (Zhao et al., 2024).
Dataset Splits	Yes	A stratified sample of 50 response pairs to public prompts was selected to evaluate how well LMRA ratings correlate with human ratings. Our analysis also covers the full distribution of English prompts: the average response quality distribution for the 4o-mini model, as rated by the 4o model, was evaluated on 100k random real chats, including chats that fall outside our hierarchy.
Hardware Specification	No	No specific hardware details (GPU/CPU models, processor types, or memory amounts) used for running experiments or LMRA evaluations are provided. The paper discusses specific language models but not the underlying hardware.
Software Dependencies	No	The paper mentions using 'Open AI’s API' and 'scikit-learn standard K-means clustering algorithm' but does not specify version numbers for these or any other software dependencies needed to replicate the experiment.
Experiment Setup	Yes	All responses were generated with Chat GPT models run at temperature 0.8 (except for the LMRA which was run at temperature 0). The order of messages is: 1. Model-specific system message... 2. Custom Instruction system message... 3. Prompt, i.e., the user message.