reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs

Authors: Zhiting Fan, Ruizhe Chen, Tianxiang Hu, Zuozhu Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments and analysis on Fair MT-10K reveal that in multi-turn dialogue scenarios, LLMs are more prone to generating biased responses, showing significant variation in performance across different tasks and models. Based on these findings, we develop a more challenging dataset, Fair MT-1K, and test 15 current state-of-the-art (SOTA) LLMs on this dataset.
Researcher Affiliation	Academia	Zhiting Fan Ruizhe Chen Tianxiang Hu Zuozhu Liu Zhejiang University
Pseudocode	Yes	The templates and generation process for each task are shown in Appendix A.1. Figure 9: Design of multi-turn prompt templates for Anaphora Ellipsis tasks.
Open Source Code	Yes	Our code and dataset are available at Fair MT-Bench. All code and models will be made publicly available to support reproducibility and facilitate further research.
Open Datasets	Yes	Our code and dataset are available at Fair MT-Bench. Specifically, we use Redditbias (Barikeri et al., 2021) and Social Bias Inference Corpus (SBIC) (Sap et al., 2019) as sources for stereotype data, and Hate Xplain (Mathew et al., 2021) as the source for toxicity data.
Dataset Splits	Yes	To enable more efficient evaluation, we distill the most challenging data from our Fair-MT Bench to create a lighter LLM fairness benchmark, Fair MT-1K. Specifically, we select data points where the six models had the highest error ratio in the original Fair MT-10K dataset based on our testing results. An equal number of samples are chosen from each task. The specific method for selecting the Fair MT-1K dataset is detailed in Appendix B.5.
Hardware Specification	Yes	In our setup, using Llama-2-7b-chat as an example, testing on the Fair MT-10K takes about 72.5 H100 GPU hours, and costs approximately 171.28 USD using the GPT-4 API for evaluation. Specific settings and cost calculations are detailed in Appendix A.5. Appendix A.5: The generation is performed on a single NVIDIA H100 GPU.
Software Dependencies	No	The paper mentions external models/APIs (GPT-4, Llama-Guard-3-8B) with implied versions, and specific models used (Llama-2-chat-hf (7B, 13B), Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, Gemma-7b-it, Chat GPT-3.5) but does not list ancillary software components like programming languages, libraries (e.g., PyTorch, TensorFlow), or specific versions for their own implementation methodology.
Experiment Setup	Yes	Settings Based on the dataset construction process outlined in the previous chapter, we generated multi-turn dialogue datasets for each task, consisting of 5 turns of prompts. During the fairness evaluation of the models, we used the prompts and responses from the earlier turns as dialogue histories in all experiments. For each LLM, we applied the corresponding chat format and system prompt, setting the temperature to 0.7 and k to 1, while limiting the max new tokens to 150. For the LLM-Judge (GPT-4), we set the temperature to 0.6.