reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Prompting Fairness: Integrating Causality to Debias Large Language Models

Authors: Jingling Li, Zeyu Tang, Xiaoyu Liu, Peter Spirtes, Kun Zhang, Liu Leqi, Yang Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our framework through extensive experiments on real-world datasets across multiple domains, demonstrating its effectiveness in debiasing LLM decisions, even with only black-box access to the model.
Researcher Affiliation	Collaboration	1Google Deep Mind 2Department of Philosophy, Carnegie Mellon University 3Department of Computer Science, University of Maryland, College Park 4Machine Learning Department, Mohamed bin Zayed University of Artificial Intelligence 5University of Texas at Austin 6Computer Science and Engineering Department, University of California, Santa Cruz
Pseudocode	No	The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured code-like formatted procedures. Figure 1(b) illustrates a systematic approach using a diagram, but it is not pseudocode.
Open Source Code	No	The paper states in Appendix C.1: 'We have provided the cleaned version [of Wino Bias] in the supplementary materials.' However, this refers to a dataset and not the source code for the methodology described in the paper. There is no explicit statement about code release or a link to a code repository.
Open Datasets	Yes	We conduct extensive experiments on three widely utilized benchmarks that evaluate language models decision bias: Wino Bias by Zhao et al. (2018), the Bias Benchmark for QA (BBQ) by Parrish et al. (2021), and Discrim-Eval by Tamkin et al. (2023).
Dataset Splits	Yes	For experiments on the Wino Bias dataset, we combined both the training and test data for evaluation as there is no need to separate them when using prompting-based debiasing techniques. ... We removed these 60 examples during our evaluation... For our experiments, we consider the disambigous setting in BBQ where we test whether the model s biases override a correct answer choice given an adequately informative context. ... There are over 16,000 examples under this setting...
Hardware Specification	No	The paper lists the large language models used for experiments (GPT-3, GPT-3.5, Claude 2, GPT-4, Mistral-7B) but does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) on which these models were run or evaluated.
Software Dependencies	Yes	For the GPT models used in our experiments on Wino Bias and Discrim-Eval, we consider snapshots from June 13th, 2023 where the knowledge cut-off time is Sep 2021. Since the legacy GPT-3 model (a.k.a., text-davinci-003) is no longer supported when we conduct the experiments, we use the model gpt-3.5-turbo-instruct instead as it has similar capabilities as GPT-3 era models. The Mistral-7B model we use in our experiments is the improved instruction fine-tuned version (a.k.a., Mistral-7B-Instruct-v0.2 ). For our experiments on BBQ data set, we use the latest GPT-4 version (i.e., gpt-4-turbo).
Experiment Setup	Yes	All LLMs responses are obtained with a temperature of 0. ... In Wino Bias, we use the same 16 ICL examples as in Si et al. (2022). ... We change the number of ICL examples to 8 to match the settings in Si et al. (2022). ... we apply 2-round iterative prompting in our experiments where we let the models generate freely and then ask them to summarize their answers in one or two words.