First-Person Fairness in Chatbots
Authors: Tyna Eloundou, Alex Beutel, David Robinson, Keren Gu, Anna-Luisa Brakman, Pamela Mishkin, Meghan Shah, Johannes Heidecke, Lilian Weng, Adam Tauman Kalai
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply this approach to assess biases in six of our language models across millions of interactions, covering sixty-six tasks in nine domains and spanning two genders and four races. Independent human annotations corroborate the LMRA-generated bias evaluations. This study represents the first large-scale fairness evaluation based on real-world chat data. |
| Researcher Affiliation | Industry | Open AI EMAIL |
| Pseudocode | Yes | The Bias Enumeration Algorithm (full details in Algorithm 1 of Appendix B) has four steps: |
| Open Source Code | No | While our specific results are not directly reproducible due to privacy, our approach is methodologically replicable meaning that it can be applied to any name-sensitive chatbot and be used to monitor for bias in deployed systems. Prompts used in this work are provided and Appendix M gives instructions on how to use the API to simulate Chat GPT behavior with arbitrary CI, facilitating future research on chatbot fairness. |
| Open Datasets | Yes | Examples published in this work and shown to crowd workers are drawn from two chat datasets that are open and publicly available: LMSYS (Zheng et al., 2023) and Wild Chat (Zhao et al., 2024). |
| Dataset Splits | Yes | A stratified sample of 50 response pairs to public prompts was selected to evaluate how well LMRA ratings correlate with human ratings. Our analysis also covers the full distribution of English prompts: the average response quality distribution for the 4o-mini model, as rated by the 4o model, was evaluated on 100k random real chats, including chats that fall outside our hierarchy. |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, processor types, or memory amounts) used for running experiments or LMRA evaluations are provided. The paper discusses specific language models but not the underlying hardware. |
| Software Dependencies | No | The paper mentions using 'Open AI’s API' and 'scikit-learn standard K-means clustering algorithm' but does not specify version numbers for these or any other software dependencies needed to replicate the experiment. |
| Experiment Setup | Yes | All responses were generated with Chat GPT models run at temperature 0.8 (except for the LMRA which was run at temperature 0). The order of messages is: 1. Model-specific system message... 2. Custom Instruction system message... 3. Prompt, i.e., the user message. |