TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning
Authors: Xiang Li, Yunshi Lan, Chao Yang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate 6 models of different parameter sizes, including 7B, 13B, and 33B, and ultimately achieved the highest correlation coefficient with Alpaca Eval2.0 using only around 45 questions. We also conduct more analysis to show the robustness and reliability of Tree Eval. |
| Researcher Affiliation | Collaboration | 1East China Normal University 2Shanghai AI Laboratory EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Procedure of Tree Eval |
| Open Source Code | Yes | Code https://github.com/Ashura5/Tree Eval |
| Open Datasets | Yes | Benchmark Paradigm: MMLU (Hendrycks et al. 2021) Big-Bench Hard (BBH) (Suzgun et al. 2022) LLMs as Judges: Alpaca Eval and Alpaca Eval2.0 (Li et al. 2023b) MT-Bench (Zheng et al. 2023a) |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for the datasets used or generated during the evaluation process. It mentions '5-shot and 3-shot contexts' which refer to evaluation settings, not dataset splits, and refers to standard benchmarks without detailing their internal splits for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It mentions using 'GPT-4-0613 as the examiner, deployed with Fast Chat', but no hardware specifications. |
| Software Dependencies | No | The paper mentions 'GPT-4-0613 as the examiner' which includes a version number for the model. It also states 'deployed with Fast Chat', but a specific version for Fast Chat is not provided, nor are other key software components with their versions. |
| Experiment Setup | Yes | We use GPT-4-0613 as the examiner, deployed with Fast Chat (Zheng et al. 2023a), with a temperature of 1 for varied question generation. We set T and k to 3, and α, β, and γ to 1, 1, and 0.4, respectively. |