Evaluating the Evaluator: Measuring LLMs’ Adherence to Task Evaluation Instructions

Authors: Bhuvanashree Murugadoss, Christian Poelitz, Ian Drosos, Vu Le, Nick McKenna, Carina Suzana Negreanu, Chris Parnin, Advait Sarkar

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements, we analyze prompts with increasing levels of instructions about the target quality of an evaluation, for several LLMs-as-a-judge. Further, we compare to a prompt-free method using model perplexity as a quality measure instead. We aggregate a taxonomy of quality criteria commonly used across state-of-the-art evaluations with LLMs and provide this as a rigorous benchmark of models as judges. Overall, we show that the LLMs-as-a-judge benefit only little from highly detailed instructions in prompts and that perplexity can sometimes align better with human judgements than prompting, especially on textual quality.
Researcher Affiliation Collaboration Bhuvanashree Murugadoss1, Christian Poelitz2, Ian Drosos2, Vu Le1, Nick Mc Kenna2, Carina Suzana Negreanu1, Chris Parnin1,3, Advait Sarkar1 1Microsoft 2Microsoft Research Cambridge 3North Carolina State University EMAIL
Pseudocode No The paper describes the methods in narrative text and provides examples of prompts in Figure 1, but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about the release of source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes We use 8 different open-source benchmark datasets commonly used for LLM-based evaluations with human annotations for several evaluation criteria per task. The datasets cover tasks which span several aspects from coarse-grained NLG-quality evaluations, to fine-grained very task specific evaluations with detailed information about how to score the example solutions. Firstly, we leverage two of the most prominently used datasets for coarse-grained NLG-quality evaluations: The Summ Eval (Fabbri et al. 2021) dataset... and the Topical Chat (Gopalakrishnan et al. 2019) dataset...
Dataset Splits No The paper refers to using "human annotations given for each quality criteria from the benchmark datasets" for evaluation, but it does not specify any training/test/validation splits used for its experiments or reference predefined splits from the benchmarks for its evaluation process.
Hardware Specification No The paper discusses the LLMs used for evaluation (e.g., GPT4-Turbo, Llama3), but it does not provide any specific details about the hardware (GPU, CPU, memory, etc.) on which these evaluations were performed or the analysis was conducted.
Software Dependencies No The paper does not provide specific version numbers for any ancillary software dependencies (e.g., programming languages, libraries, or frameworks) used to conduct the experiments or analysis.
Experiment Setup No The paper describes different prompting settings used to evaluate LLMs-as-a-judge, but it does not provide specific experimental setup details such as hyperparameters, optimizer settings, or training configurations, as the work involves evaluating pre-trained models rather than training new ones.