Test-Time Fairness and Robustness in Large Language Models
Authors: Leonardo Cotta, Chris J. Maddison
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we use six benchmark datasets with nine different protected/spurious attributes to show that OOC prompting achieves state-of-the-art results on fairness and robustness across different model families and sizes, without sacrificing much predictive performance. |
| Researcher Affiliation | Collaboration | Leonardo Cotta EMAIL Vector Institute Chris J Maddison EMAIL University of Toronto Vector Institute |
| Pseudocode | Yes | Algorithm 1 OOC prompting strategy. |
| Open Source Code | No | The paper does not provide explicit links to source code or statements about its release. The link provided in the header is for the paper's review on Open Review, not a code repository. |
| Open Datasets | Yes | Toxic Comments. We consider the dataset civilcomments as proposed in Koh et al. (2021). Bios. We take the dataset of biography passages originally proposed by De-Arteaga et al. (2019). Amazon. Here, we have the Amazon fashion reviews dataset (Ni et al., 2019). Discrimination. We also take the synthetic dataset of yes/no questions recently proposed by Tamkin et al. (2023). Clinical. Finally, we consider the MIMIC-III (Johnson et al., 2016) set of clinical notes (X). Both the context and the label information are extracted from the subset MIMICSBDH (Ahsan et al., 2021). |
| Dataset Splits | Yes | For each dataset and context pair, we estimate the SI-bias with 200 random examples balanced according to S and Z. To compute the predictive performance (macro F1-score2) of each prompting strategy, we take 200 random examples sampled i.i.d. from the original dataset. |
| Hardware Specification | No | The paper mentions several LLM models used (e.g., gpt-3.5-turbo, gpt-4-turbo, LLAMA-3-70B, Claude 3.5 Sonnet, gpt-4o-mini) but does not provide any specific details about the hardware (GPUs, CPUs, etc.) on which these models were run for the experiments. |
| Software Dependencies | No | The paper mentions the use of various LLMs (e.g., gpt-3.5-turbo, gpt-4-turbo, LLAMA-3-70B, Claude 3.5 Sonnet), but it does not specify any software libraries, frameworks, or their version numbers that would be necessary to replicate the experiments. |
| Experiment Setup | Yes | As common practice (Wei et al., 2022), we use temperature 0 to predict the labels of each task (including OOC). We evaluate stratified invariance in three popular, frontier LLM models: gpt-3.5-turbo, gpt-4-turbo (Open AI, 2023), and LLAMA-3-70B (Dubey et al., 2024). As suggested in (Sordoni et al., 2023), we generate our counterfactual transformations with a temperature of 0.7 (GPT family) and 0.8 in the other models. In each task, we used m = 3 samples for OOC with all models and tasks except for gpt-4-turbo and Clinical where we used m = 1 due to their high monetary cost and larger input size, respectively. |