Evaluating the Evaluator: Measuring LLMs’ Adherence to Task Evaluation Instructions
Authors: Bhuvanashree Murugadoss, Christian Poelitz, Ian Drosos, Vu Le, Nick McKenna, Carina Suzana Negreanu, Chris Parnin, Advait Sarkar
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements, we analyze prompts with increasing levels of instructions about the target quality of an evaluation, for several LLMs-as-a-judge. Further, we compare to a prompt-free method using model perplexity as a quality measure instead. We aggregate a taxonomy of quality criteria commonly used across state-of-the-art evaluations with LLMs and provide this as a rigorous benchmark of models as judges. Overall, we show that the LLMs-as-a-judge benefit only little from highly detailed instructions in prompts and that perplexity can sometimes align better with human judgements than prompting, especially on textual quality. |
| Researcher Affiliation | Collaboration | Bhuvanashree Murugadoss1, Christian Poelitz2, Ian Drosos2, Vu Le1, Nick Mc Kenna2, Carina Suzana Negreanu1, Chris Parnin1,3, Advait Sarkar1 1Microsoft 2Microsoft Research Cambridge 3North Carolina State University EMAIL |
| Pseudocode | No | The paper describes the methods in narrative text and provides examples of prompts in Figure 1, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about the release of source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We use 8 different open-source benchmark datasets commonly used for LLM-based evaluations with human annotations for several evaluation criteria per task. The datasets cover tasks which span several aspects from coarse-grained NLG-quality evaluations, to fine-grained very task specific evaluations with detailed information about how to score the example solutions. Firstly, we leverage two of the most prominently used datasets for coarse-grained NLG-quality evaluations: The Summ Eval (Fabbri et al. 2021) dataset... and the Topical Chat (Gopalakrishnan et al. 2019) dataset... |
| Dataset Splits | No | The paper refers to using "human annotations given for each quality criteria from the benchmark datasets" for evaluation, but it does not specify any training/test/validation splits used for its experiments or reference predefined splits from the benchmarks for its evaluation process. |
| Hardware Specification | No | The paper discusses the LLMs used for evaluation (e.g., GPT4-Turbo, Llama3), but it does not provide any specific details about the hardware (GPU, CPU, memory, etc.) on which these evaluations were performed or the analysis was conducted. |
| Software Dependencies | No | The paper does not provide specific version numbers for any ancillary software dependencies (e.g., programming languages, libraries, or frameworks) used to conduct the experiments or analysis. |
| Experiment Setup | No | The paper describes different prompting settings used to evaluate LLMs-as-a-judge, but it does not provide specific experimental setup details such as hyperparameters, optimizer settings, or training configurations, as the work involves evaluating pre-trained models rather than training new ones. |