reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Benchmarking LLMs' Judgments with No Gold Standard

Authors: Shengwei Xu, Yuxuan Lu, Grant Schoenebeck, Yuqing Kong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments on a human-annotated dataset, GEM demonstrates competitive correlations with human scores compared to the state-of-the-art GPT-4o Examiner, and outperforms all other baselines. Additionally, GEM is more robust against strategic manipulations, such as rephrasing or elongation, which can artificially inflate scores under a GPT-4o Examiner. We also present GRE-bench (Generating Review Evaluation Benchmark) which evaluates LLMs based on how well they can generate high-quality peer reviews for academic research papers.
Researcher Affiliation	Academia	Shengwei Xu School of Information University of Michigan, Ann Arbor, USA EMAIL Yuxuan Lu School of Computer Science Peking University, Beijing, China yx EMAIL Grant Schoenebeck School of Information University of Michigan, Ann Arbor, USA EMAIL Yuqing Kong School of Computer Science Peking University, Beijing, China EMAIL
Pseudocode	Yes	ALGORITHM 1: Validation Workflow Input: A dataset D with n tuples of tasks and associated text responses. An evaluation metric f computing the scores. A degradation/manipulation strategy M. Output: Statistics metrics. for i=1 to n do Get the i-th tuple from the dataset: task wi, candidate response xi, and reference response yi ; Compute si :=f(wi,xi,yi); Replace the response xi with x i according to degradation/manipulation strategy M; Compute s i :=f(wi,x i,yi); end Compute the means µ,µ and standard deviations σ,σ of {si}i [n] and {s i}i [n] respectively.
Open Source Code	Yes	Ready-to-Use Code We have made all the code for data collection, model fine-tuning, metric computation, and experiments available on Git Hub. The repository is at https://github.com/ yx-lu/Benchmarking-LLMs--Judgments-with-No-Gold-Standard.
Open Datasets	Yes	ICLR Dataset. We use the ICLR 20233 peer review data publicly available on Open Review. We randomly select 300 papers as our benchmark dataset, and for each paper, we randomly select 3 original human reviews: one review to serve as a human candidate, and the other two as reference responses.
Dataset Splits	Yes	ICLR Dataset. We use the ICLR 20233 peer review data publicly available on Open Review. We randomly select 300 papers as our benchmark dataset, and for each paper, we randomly select 3 original human reviews: one review to serve as a human candidate, and the other two as reference responses.
Hardware Specification	Yes	Experiments with Llama-3.1-8B as the evaluation LM are conducted on Google Colab instances with NVIDIA A100 GPU. Experiments in Section 5 employ Llama-3.1-70B as the evaluation LM requiring more GPU VRAM, thus, we use rental instances with a configuration of two NVIDIA H100 GPUs.
Software Dependencies	Yes	In the validation experiment in this section, Llama-3.1 8B is used as the evaluation-LM to estimate log Pr LLM[Y =y\|X = x] and log Pr LLM[Y =y]. After confirming the effectiveness of GEM metrics using the small model, for even better performance, we scaled up to a larger, 70B-parameter version of Llama-3.1 for computing the GRE-bench on the ICLR 2023 dataset in Section 5. We show the correlation between results of the smaller (8B) and larger (70B) models in Appendix A.2, to ensure robustness of our approach. For text preprocessing, we employ GPT-4o.
Experiment Setup	Yes	In all LLM API calls in our experiments, for reproducibility, we keep the temperature at 0 and the maximum (output) token count at 4000, unless otherwise specified. ...Here is the settings we use for fine-tuning: "model_config": { "base_model":"unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit", # The base model "max_seq_length": 4096, # The maximum sequence length "load_in_4bit": True, # Load the model in 4-bit }, "lora_config": { "r": 16, # The number of Lo RA layers 8, 16, 32, 64 "lora_alpha":16, # The alpha value for Lo RA "lora_dropout":0, # The dropout value for Lo RA }, "training_config": { "per_device_train_batch_size": 2, # The batch size "gradient_accumulation_steps": 4, # The gradient accumulation steps "warmup_steps": 5, # The warmup steps "max_steps":0, # The maximum steps "num_train_epochs": 3, # The number of training epochs "learning_rate": 2e-4, # The learning rate "optim" :"adamw_8bit", # The optimizer "weight_decay" : 0.01, # The weight decay "lr_scheduler_type": "linear", # The learning rate scheduler "seed" : 42, # The seed }