Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Authors: Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh Chawla, Xiangliang Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments cover multiple popular language models, and the results indicate that while advanced models have achieved commendable overall performance, significant biases persist in certain specific tasks. An extensive evaluation of six popular LLMs using the CALM framework, as shown in Figure 1, reveals that while some LLMs demonstrate notable fairness in judgment, there remains significant room for improvement in achieving more robust decision-making across various types of bias. |
| Researcher Affiliation | Collaboration | University of Notre Dame MBZUAI University of Washington Peking University IBM Research University of Hong Kong EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and processes like the automated perturbation mechanism g(), but it does not present any formal pseudocode or algorithm blocks describing structured steps for a procedure. Instead, it provides prompt templates for LLM interactions in Appendix G, which are not considered pseudocode. |
| Open Source Code | Yes | To ensure reproducibility, the supplementary materials accompanying this paper include our complete experimental code, datasets, and evaluation scripts. These materials cover core components such as data generation, prompt templates, and API handlers, as well as specific code and result logs for different bias types. |
| Open Datasets | Yes | We prepared three datasets in CALM for supporting bias assessment in various judging tasks: fact-related, refinement-aware evaluation, and alignment datasets. The details of these datasets are shown in Table 3. Table 3 lists sources such as Truthy-DPO-v0.1 (Durbin, 2023), Orca-DPO-Pairs (Intel, 2023), GSM8K (Cobbe et al., 2021), and Truthful QA (Lin et al., 2022), with explicit citations. |
| Dataset Splits | Yes | We prepared three datasets in CALM for supporting bias assessment in various judging tasks: fact-related, refinement-aware evaluation, and alignment datasets. The details of these datasets are shown in Table 3. Table 3 specifies the number of samples for each dataset, such as 439 for Alignment, 500 for Fact-related, and 500 for Refinement. The metrics section states 'calculating over all samples in test dataset D', indicating that these full datasets serve as the test sets for their experiments. |
| Hardware Specification | No | The paper discusses the large language models evaluated (e.g., Chat GPT, GPT-4-Turbo, Claude-3.5) and generative models (e.g., Mixtral-8x22b, Llama3-70b) but does not provide any specific details about the hardware (e.g., GPU models, CPU types) used to conduct the experiments. |
| Software Dependencies | Yes | The selected models are: Chat GPT (Open AI, 2024b), GPT-4-Turbo (Open AI, 2024a), GPT-4o (Open AI, 2024c), Claude3.5 (Anthropic, 2024), GLM-4 (GLM et al., 2024), and the open-source Qwen2-72B-Instruct (Bai et al., 2023), which are further detailed in Table 11. Table 11 explicitly lists specific model versions like 'gpt-3.5-turbo-0125' and 'gpt-4-turbo-0409'. |
| Experiment Setup | Yes | We followed the experimental setup of Chen et al. (2024b) by setting the temperature to 0.7 and applied it to all judge models and generating models to ensure stable output quality and strong reproducibility. |