TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning

Authors: Xiang Li, Yunshi Lan, Chao Yang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate 6 models of different parameter sizes, including 7B, 13B, and 33B, and ultimately achieved the highest correlation coefficient with Alpaca Eval2.0 using only around 45 questions. We also conduct more analysis to show the robustness and reliability of Tree Eval.
Researcher Affiliation Collaboration 1East China Normal University 2Shanghai AI Laboratory EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1: Procedure of Tree Eval
Open Source Code Yes Code https://github.com/Ashura5/Tree Eval
Open Datasets Yes Benchmark Paradigm: MMLU (Hendrycks et al. 2021) Big-Bench Hard (BBH) (Suzgun et al. 2022) LLMs as Judges: Alpaca Eval and Alpaca Eval2.0 (Li et al. 2023b) MT-Bench (Zheng et al. 2023a)
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for the datasets used or generated during the evaluation process. It mentions '5-shot and 3-shot contexts' which refer to evaluation settings, not dataset splits, and refers to standard benchmarks without detailing their internal splits for reproducibility.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It mentions using 'GPT-4-0613 as the examiner, deployed with Fast Chat', but no hardware specifications.
Software Dependencies No The paper mentions 'GPT-4-0613 as the examiner' which includes a version number for the model. It also states 'deployed with Fast Chat', but a specific version for Fast Chat is not provided, nor are other key software components with their versions.
Experiment Setup Yes We use GPT-4-0613 as the examiner, deployed with Fast Chat (Zheng et al. 2023a), with a temperature of 1 for varied question generation. We set T and k to 3, and α, β, and γ to 1, 1, and 0.4, respectively.