Benchmarking LLMs' Judgments with No Gold Standard

Authors: Shengwei Xu, Yuxuan Lu, Grant Schoenebeck, Yuqing Kong

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments on a human-annotated dataset, GEM demonstrates competitive correlations with human scores compared to the state-of-the-art GPT-4o Examiner, and outperforms all other baselines. Additionally, GEM is more robust against strategic manipulations, such as rephrasing or elongation, which can artificially inflate scores under a GPT-4o Examiner. We also present GRE-bench (Generating Review Evaluation Benchmark) which evaluates LLMs based on how well they can generate high-quality peer reviews for academic research papers.
Researcher Affiliation Academia Shengwei Xu School of Information University of Michigan, Ann Arbor, USA EMAIL Yuxuan Lu School of Computer Science Peking University, Beijing, China yx EMAIL Grant Schoenebeck School of Information University of Michigan, Ann Arbor, USA EMAIL Yuqing Kong School of Computer Science Peking University, Beijing, China EMAIL
Pseudocode Yes ALGORITHM 1: Validation Workflow Input: A dataset D with n tuples of tasks and associated text responses. An evaluation metric f computing the scores. A degradation/manipulation strategy M. Output: Statistics metrics. for i=1 to n do Get the i-th tuple from the dataset: task wi, candidate response xi, and reference response yi ; Compute si :=f(wi,xi,yi); Replace the response xi with x i according to degradation/manipulation strategy M; Compute s i :=f(wi,x i,yi); end Compute the means µ,µ and standard deviations σ,σ of {si}i [n] and {s i}i [n] respectively.
Open Source Code Yes Ready-to-Use Code We have made all the code for data collection, model fine-tuning, metric computation, and experiments available on Git Hub. The repository is at https://github.com/ yx-lu/Benchmarking-LLMs--Judgments-with-No-Gold-Standard.
Open Datasets Yes ICLR Dataset. We use the ICLR 20233 peer review data publicly available on Open Review. We randomly select 300 papers as our benchmark dataset, and for each paper, we randomly select 3 original human reviews: one review to serve as a human candidate, and the other two as reference responses.
Dataset Splits Yes ICLR Dataset. We use the ICLR 20233 peer review data publicly available on Open Review. We randomly select 300 papers as our benchmark dataset, and for each paper, we randomly select 3 original human reviews: one review to serve as a human candidate, and the other two as reference responses.
Hardware Specification Yes Experiments with Llama-3.1-8B as the evaluation LM are conducted on Google Colab instances with NVIDIA A100 GPU. Experiments in Section 5 employ Llama-3.1-70B as the evaluation LM requiring more GPU VRAM, thus, we use rental instances with a configuration of two NVIDIA H100 GPUs.
Software Dependencies Yes In the validation experiment in this section, Llama-3.1 8B is used as the evaluation-LM to estimate log Pr LLM[Y =y|X = x] and log Pr LLM[Y =y]. After confirming the effectiveness of GEM metrics using the small model, for even better performance, we scaled up to a larger, 70B-parameter version of Llama-3.1 for computing the GRE-bench on the ICLR 2023 dataset in Section 5. We show the correlation between results of the smaller (8B) and larger (70B) models in Appendix A.2, to ensure robustness of our approach. For text preprocessing, we employ GPT-4o.
Experiment Setup Yes In all LLM API calls in our experiments, for reproducibility, we keep the temperature at 0 and the maximum (output) token count at 4000, unless otherwise specified. ...Here is the settings we use for fine-tuning: "model_config": { "base_model":"unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit", # The base model "max_seq_length": 4096, # The maximum sequence length "load_in_4bit": True, # Load the model in 4-bit }, "lora_config": { "r": 16, # The number of Lo RA layers 8, 16, 32, 64 "lora_alpha":16, # The alpha value for Lo RA "lora_dropout":0, # The dropout value for Lo RA }, "training_config": { "per_device_train_batch_size": 2, # The batch size "gradient_accumulation_steps": 4, # The gradient accumulation steps "warmup_steps": 5, # The warmup steps "max_steps":0, # The maximum steps "num_train_epochs": 3, # The number of training epochs "learning_rate": 2e-4, # The learning rate "optim" :"adamw_8bit", # The optimizer "weight_decay" : 0.01, # The weight decay "lr_scheduler_type": "linear", # The learning rate scheduler "seed" : 42, # The seed }