reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

Authors: Qingchen Yu, Zifan Zheng, Shichao Song, Zhiyu li, Feiyu Xiong, Bo Tang, Ding Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Generalization tests and real-world evaluations show that the smallest x Finder model, with only 500 million parameters, achieves an average extraction accuracy of 93.42%. In contrast, Reg Ex accuracy in the best evaluation framework is 74.38%. The final judgment accuracy of x Finder reaches 97.61%, outperforming existing evaluation frameworks and judge models. All resources for x Finder are available at https://github.com/IAAR-Shanghai/x Finder.
Researcher Affiliation	Academia	1Institute for Advanced Algorithms Research, Shanghai 2Renmin University of China EMAIL
Pseudocode	No	The paper describes a novel evaluator and its methodology, including dataset construction and model training. It provides a schematic diagram in Figure 3, but no structured pseudocode or algorithm blocks are present in the text.
Open Source Code	Yes	All resources for x Finder are available at https://github.com/IAAR-Shanghai/x Finder.
Open Datasets	Yes	To address these issues, we propose x Finder, a novel evaluator for answer extraction and matching in LLM evaluation. As part of this process, we create a specialized dataset, the Key Answer Finder (KAF) dataset, to ensure effective model training and evaluation. ... All resources for x Finder are available at https://github.com/IAAR-Shanghai/x Finder.
Dataset Splits	Yes	The KAF dataset encompasses a variety of evaluation tasks, including questions, optional answer ranges, LLM responses to the questions, and the extracted key answers. The dataset is divided into three segments: training, test, and generalization sets, used for fine-tuning, testing, and performance assessment, respectively. The training set has 26,900 samples, the test set 4,961 samples, and the generalization set 4,482 samples.
Hardware Specification	Yes	We randomly sampled 200 instances from four question types within the Generalization Set. As shown in Table 5, we present the time required to evaluate these tasks on the same machine equipped with 8 NVIDIA H100 (80G) GPUs. Notably, Judge LM-33B utilized 2 GPUs, while the other models operated on a single GPU, with all evaluations conducted in a single process. ... The training was conducted on 8x A100 GPUs.
Software Dependencies	No	During the training process, we primarily utilized the XTuner (Intern LM, 2023) tool developed by the Intern LM team for fine-tuning. ... We employed the QLo RA method from the XTuner framework (Intern LM, 2023; Dettmers et al., 2024) for fine-tuning the foundation models... The paper mentions XTuner but does not specify a version number.
Experiment Setup	Yes	When fine-tuning x Finder using the XTuner framework and the QLo RA method, all base models were trained with identical hyperparameters. The training was conducted on 8x A100 GPUs. We focused on the hyperparameters specified in Table 21, while all other hyperparameters were set to their default values. Table 21: x Finder fine-tuning hyperparameter setting. Hyperparameter Setting Batch Size 1 Maximum Length 2048 Learning Rate 2e-4 Pack to Max Length True Optimizer Type Adam W Betas (0.9, 0.999) Weight Decay 0 Warmup Ratio 0.03