reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

Authors: Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh Chawla, Xiangliang Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments cover multiple popular language models, and the results indicate that while advanced models have achieved commendable overall performance, significant biases persist in certain specific tasks. An extensive evaluation of six popular LLMs using the CALM framework, as shown in Figure 1, reveals that while some LLMs demonstrate notable fairness in judgment, there remains significant room for improvement in achieving more robust decision-making across various types of bias.
Researcher Affiliation	Collaboration	University of Notre Dame MBZUAI University of Washington Peking University IBM Research University of Hong Kong EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods and processes like the automated perturbation mechanism g(), but it does not present any formal pseudocode or algorithm blocks describing structured steps for a procedure. Instead, it provides prompt templates for LLM interactions in Appendix G, which are not considered pseudocode.
Open Source Code	Yes	To ensure reproducibility, the supplementary materials accompanying this paper include our complete experimental code, datasets, and evaluation scripts. These materials cover core components such as data generation, prompt templates, and API handlers, as well as specific code and result logs for different bias types.
Open Datasets	Yes	We prepared three datasets in CALM for supporting bias assessment in various judging tasks: fact-related, refinement-aware evaluation, and alignment datasets. The details of these datasets are shown in Table 3. Table 3 lists sources such as Truthy-DPO-v0.1 (Durbin, 2023), Orca-DPO-Pairs (Intel, 2023), GSM8K (Cobbe et al., 2021), and Truthful QA (Lin et al., 2022), with explicit citations.
Dataset Splits	Yes	We prepared three datasets in CALM for supporting bias assessment in various judging tasks: fact-related, refinement-aware evaluation, and alignment datasets. The details of these datasets are shown in Table 3. Table 3 specifies the number of samples for each dataset, such as 439 for Alignment, 500 for Fact-related, and 500 for Refinement. The metrics section states 'calculating over all samples in test dataset D', indicating that these full datasets serve as the test sets for their experiments.
Hardware Specification	No	The paper discusses the large language models evaluated (e.g., Chat GPT, GPT-4-Turbo, Claude-3.5) and generative models (e.g., Mixtral-8x22b, Llama3-70b) but does not provide any specific details about the hardware (e.g., GPU models, CPU types) used to conduct the experiments.
Software Dependencies	Yes	The selected models are: Chat GPT (Open AI, 2024b), GPT-4-Turbo (Open AI, 2024a), GPT-4o (Open AI, 2024c), Claude3.5 (Anthropic, 2024), GLM-4 (GLM et al., 2024), and the open-source Qwen2-72B-Instruct (Bai et al., 2023), which are further detailed in Table 11. Table 11 explicitly lists specific model versions like 'gpt-3.5-turbo-0125' and 'gpt-4-turbo-0409'.
Experiment Setup	Yes	We followed the experimental setup of Chen et al. (2024b) by setting the temperature to 0.7 and applied it to all judge models and generating models to ensure stable output quality and strong reproducibility.