RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
Authors: Qiyuan Zhang, Yufei Wang, Tiezheng YU, Yuxin Jiang, Chuhan Wu, Liangyou Li, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that REVISEVAL outperforms traditional reference-free and reference-based evaluation paradigms that use LLM-as-a-Judge across NLG tasks and open-ended instruction-following tasks. More importantly, our response-adapted references can further boost the classical text metrics, e.g., BLEU and BERTScore, compared to traditional references and even rival the LLM-as-a-Judge. A detailed analysis is also conducted to confirm REVISEVAL s effectiveness in bias reduction, the impact of inference cost, and reference relevance. |
| Researcher Affiliation | Collaboration | 1City University of Hong Kong, 2Huawei Noah s Ark Lab, 3The Hong Kong University of Science and Technology (Guangzhou), 4Mc Gill University & MILA |
| Pseudocode | No | The paper includes mathematical formulations (Equations 1-4) and prompt templates in the Appendix (Figures 4-10), but no structured pseudocode or algorithm blocks are explicitly labeled or presented as such in the main text or appendices. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its own source code, nor does it include a direct link to a code repository for the methodology described. It mentions open-source LLMs and datasets that were used (e.g., Llama 3.1-8B-Inst, Metric Instruct, hh-rlhf) but these are third-party resources, not the authors' implementation code. |
| Open Datasets | Yes | We evaluate our approach on multiple classic NLG benchmarks by measuring the correlation between the evaluators/metrics and human annotations in a scoring rating task. We follow the experimental setting of Jiang et al. (2024a) and select four representative NLG tasks and corresponding benchmarks: Data-to-Text (Web NLG), Machine Translation (WMT-22 (zhen)), Text Summarization (Summ Eval), and Story Generation (Open MEVA), and Table 10 shows the details of these benchmarks. Additionally, we test our approach on the more challenging openended instruction-following benchmarks (MT-Bench, Alpacafarm, and LLMBar), which primarily rely on pairwise comparison task. |
| Dataset Splits | Yes | We select the first round of dialogues from this dataset as the evaluation data, containing 1284 cases. (MT-Bench) We filtered out these tied cases, leaving a final evaluation dataset of 501 instances. (Alpaca Farm) The overall size is 419. (LLMBar) In the end, we selected 10,000 samples, each containing corresponding revisions, reference-free evaluation, and reference-based evaluation. These three sets were then used to train the same model separately, ensuring that no new information is introduced to alter the distribution. |
| Hardware Specification | No | The paper mentions using proprietary LLMs (GPT-4) and open-source LLMs (Llama 3.1-8B-Inst) for experiments and fine-tuning. However, it does not specify any particular hardware components such as GPU models, CPU types, or memory used for running these models or for the fine-tuning process. |
| Software Dependencies | No | The paper specifies the version of GPT-4 used for inference and distillation ('GPT-4 version GPT-4-TURBO-2024-04-09') and the base model for open-source fine-tuning ('LLAMA 3.1-8B-INST'). However, it does not list any other ancillary software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other solvers with specific version numbers. |
| Experiment Setup | Yes | For reproducibility, we used the GPT-4 version GPT-4-TURBO-2024-04-09, with a temperature setting of 0.0. Training Setting. We followed the common setup for supervised instruction finetuning, with a context length = 2048, epochs = 3, batch size = 128, and learning rate = 2e-5. Decoding Setting. When the finetuned model executes the evaluation and revision tasks, the decoding setting uses a greedy decoding strategy with a max output length = 1024 and temperature = 0.01. |