reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

Authors: Qiyuan Zhang, Yufei Wang, Tiezheng YU, Yuxin Jiang, Chuhan Wu, Liangyou Li, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that REVISEVAL outperforms traditional reference-free and reference-based evaluation paradigms that use LLM-as-a-Judge across NLG tasks and open-ended instruction-following tasks. More importantly, our response-adapted references can further boost the classical text metrics, e.g., BLEU and BERTScore, compared to traditional references and even rival the LLM-as-a-Judge. A detailed analysis is also conducted to confirm REVISEVAL s effectiveness in bias reduction, the impact of inference cost, and reference relevance.
Researcher Affiliation	Collaboration	1City University of Hong Kong, 2Huawei Noah s Ark Lab, 3The Hong Kong University of Science and Technology (Guangzhou), 4Mc Gill University & MILA
Pseudocode	No	The paper includes mathematical formulations (Equations 1-4) and prompt templates in the Appendix (Figures 4-10), but no structured pseudocode or algorithm blocks are explicitly labeled or presented as such in the main text or appendices.
Open Source Code	No	The paper does not provide an explicit statement about releasing its own source code, nor does it include a direct link to a code repository for the methodology described. It mentions open-source LLMs and datasets that were used (e.g., Llama 3.1-8B-Inst, Metric Instruct, hh-rlhf) but these are third-party resources, not the authors' implementation code.
Open Datasets	Yes	We evaluate our approach on multiple classic NLG benchmarks by measuring the correlation between the evaluators/metrics and human annotations in a scoring rating task. We follow the experimental setting of Jiang et al. (2024a) and select four representative NLG tasks and corresponding benchmarks: Data-to-Text (Web NLG), Machine Translation (WMT-22 (zhen)), Text Summarization (Summ Eval), and Story Generation (Open MEVA), and Table 10 shows the details of these benchmarks. Additionally, we test our approach on the more challenging openended instruction-following benchmarks (MT-Bench, Alpacafarm, and LLMBar), which primarily rely on pairwise comparison task.
Dataset Splits	Yes	We select the first round of dialogues from this dataset as the evaluation data, containing 1284 cases. (MT-Bench) We filtered out these tied cases, leaving a final evaluation dataset of 501 instances. (Alpaca Farm) The overall size is 419. (LLMBar) In the end, we selected 10,000 samples, each containing corresponding revisions, reference-free evaluation, and reference-based evaluation. These three sets were then used to train the same model separately, ensuring that no new information is introduced to alter the distribution.
Hardware Specification	No	The paper mentions using proprietary LLMs (GPT-4) and open-source LLMs (Llama 3.1-8B-Inst) for experiments and fine-tuning. However, it does not specify any particular hardware components such as GPU models, CPU types, or memory used for running these models or for the fine-tuning process.
Software Dependencies	No	The paper specifies the version of GPT-4 used for inference and distillation ('GPT-4 version GPT-4-TURBO-2024-04-09') and the base model for open-source fine-tuning ('LLAMA 3.1-8B-INST'). However, it does not list any other ancillary software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other solvers with specific version numbers.
Experiment Setup	Yes	For reproducibility, we used the GPT-4 version GPT-4-TURBO-2024-04-09, with a temperature setting of 0.0. Training Setting. We followed the common setup for supervised instruction finetuning, with a context length = 2048, epochs = 3, batch size = 128, and learning rate = 2e-5. Decoding Setting. When the finetuned model executes the evaluation and revision tasks, the decoding setting uses a greedy decoding strategy with a max output length = 1024 and temperature = 0.01.