reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning

Authors: Xiang Li, Yunshi Lan, Chao Yang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate 6 models of different parameter sizes, including 7B, 13B, and 33B, and ultimately achieved the highest correlation coefficient with Alpaca Eval2.0 using only around 45 questions. We also conduct more analysis to show the robustness and reliability of Tree Eval.
Researcher Affiliation	Collaboration	1East China Normal University 2Shanghai AI Laboratory EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Procedure of Tree Eval
Open Source Code	Yes	Code https://github.com/Ashura5/Tree Eval
Open Datasets	Yes	Benchmark Paradigm: MMLU (Hendrycks et al. 2021) Big-Bench Hard (BBH) (Suzgun et al. 2022) LLMs as Judges: Alpaca Eval and Alpaca Eval2.0 (Li et al. 2023b) MT-Bench (Zheng et al. 2023a)
Dataset Splits	No	The paper does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for the datasets used or generated during the evaluation process. It mentions '5-shot and 3-shot contexts' which refer to evaluation settings, not dataset splits, and refers to standard benchmarks without detailing their internal splits for reproducibility.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It mentions using 'GPT-4-0613 as the examiner, deployed with Fast Chat', but no hardware specifications.
Software Dependencies	No	The paper mentions 'GPT-4-0613 as the examiner' which includes a version number for the model. It also states 'deployed with Fast Chat', but a specific version for Fast Chat is not provided, nor are other key software components with their versions.
Experiment Setup	Yes	We use GPT-4-0613 as the examiner, deployed with Fast Chat (Zheng et al. 2023a), with a temperature of 1 for varied question generation. We set T and k to 3, and α, β, and γ to 1, 1, and 0.4, respectively.