FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs

Authors: Zhiting Fan, Ruizhe Chen, Tianxiang Hu, Zuozhu Liu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments and analysis on Fair MT-10K reveal that in multi-turn dialogue scenarios, LLMs are more prone to generating biased responses, showing significant variation in performance across different tasks and models. Based on these findings, we develop a more challenging dataset, Fair MT-1K, and test 15 current state-of-the-art (SOTA) LLMs on this dataset.
Researcher Affiliation Academia Zhiting Fan Ruizhe Chen Tianxiang Hu Zuozhu Liu Zhejiang University
Pseudocode Yes The templates and generation process for each task are shown in Appendix A.1. Figure 9: Design of multi-turn prompt templates for Anaphora Ellipsis tasks.
Open Source Code Yes Our code and dataset are available at Fair MT-Bench. All code and models will be made publicly available to support reproducibility and facilitate further research.
Open Datasets Yes Our code and dataset are available at Fair MT-Bench. Specifically, we use Redditbias (Barikeri et al., 2021) and Social Bias Inference Corpus (SBIC) (Sap et al., 2019) as sources for stereotype data, and Hate Xplain (Mathew et al., 2021) as the source for toxicity data.
Dataset Splits Yes To enable more efficient evaluation, we distill the most challenging data from our Fair-MT Bench to create a lighter LLM fairness benchmark, Fair MT-1K. Specifically, we select data points where the six models had the highest error ratio in the original Fair MT-10K dataset based on our testing results. An equal number of samples are chosen from each task. The specific method for selecting the Fair MT-1K dataset is detailed in Appendix B.5.
Hardware Specification Yes In our setup, using Llama-2-7b-chat as an example, testing on the Fair MT-10K takes about 72.5 H100 GPU hours, and costs approximately 171.28 USD using the GPT-4 API for evaluation. Specific settings and cost calculations are detailed in Appendix A.5. Appendix A.5: The generation is performed on a single NVIDIA H100 GPU.
Software Dependencies No The paper mentions external models/APIs (GPT-4, Llama-Guard-3-8B) with implied versions, and specific models used (Llama-2-chat-hf (7B, 13B), Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, Gemma-7b-it, Chat GPT-3.5) but does not list ancillary software components like programming languages, libraries (e.g., PyTorch, TensorFlow), or specific versions for their own implementation methodology.
Experiment Setup Yes Settings Based on the dataset construction process outlined in the previous chapter, we generated multi-turn dialogue datasets for each task, consisting of 5 turns of prompts. During the fairness evaluation of the models, we used the prompts and responses from the earlier turns as dialogue histories in all experiments. For each LLM, we applied the corresponding chat format and system prompt, setting the temperature to 0.7 and k to 1, while limiting the max new tokens to 150. For the LLM-Judge (GPT-4), we set the temperature to 0.6.