BenTo: Benchmark Reduction with In-Context Transferability
Authors: Hongyu Zhao, Ming Li, Lichao Sun, Tianyi Zhou
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluating large language models (LLMs) is costly: it requires the generation and examination of LLM outputs on a large-scale benchmark of various tasks. This paper investigates how to efficiently reduce the tasks used to benchmark LLMs without affecting the evaluation quality. Our study reveals that task transferability and relevance provide critical information to identify the most representative subset of tasks via optimizing a facility location function. We propose a practically efficient metric for estimating the transferability between two tasks via in-context learning (ICL). By analyzing the pairwise transferability, we can reduce tasks in a modern LLM benchmark (e.g., MMLU or FLAN) to 5% while inducing only a < 4% difference to the evaluation on the original benchmark. Extensive experiments are conducted to evaluate the effectiveness of BENTO-reduced benchmark by comparing the performance of several widely used LLMs on both the reduced and original benchmarks. Remarkably, as is shown in Figure 1 (RIGHT) and Table 1, the results are highly consistent, even though the reduced benchmark comprises only 5% of the original tasks. |
| Researcher Affiliation | Academia | Hongyu Zhao1, Ming Li1, Lichao Sun2, Tianyi Zhou1 1University of Maryland, College Park 2Lehigh University EMAIL |
| Pseudocode | Yes | A ALGORITHM Our method BENTO is described in Algorithm 1. Algorithm 1 Benchmark Task Reduction (BENTO) |
| Open Source Code | Yes | Project: https://github.com/tianyi-lab/bento |
| Open Datasets | Yes | Benchmarks. We assess our method mainly on two benchmarks: MMLU (Hendrycks et al., 2021a;b) and FLAN (Wei et al., 2021). ... AGIEval (Zhong et al., 2023) is a question-answering dataset... Big-Bench Hard (Suzgun et al., 2022) is a benchmark with 27 subtasks... |
| Dataset Splits | No | Benchmarks. We assess our method mainly on two benchmarks: MMLU (Hendrycks et al., 2021a;b) and FLAN (Wei et al., 2021). MMLU is a question-answering dataset containing 57 tasks with diverse subjects and levels of difficulty, mainly focusing on the knowledge of the language models. All the questions in MMLU are multiple-choice questions with 4 options for each answer. On MMLU, we use accuracy (ACC) as the evaluation metric s( , ) in Equation (1). FLAN is a dataset with more diverse forms of tasks, including free-form generation tasks like translation and summarization. Since FLAN is a large dataset, we sampled 100 questions from each of its 66 tasks as our whole benchmark. On FLAN, we use response perplexity as the evaluation metric s( , ). We follow widely used prompts for these benchmarks without any re-engineering, more details in Appendix D. ... On this dataset, we only use the 3 to 5 examples in the training set as the exemplars. |
| Hardware Specification | Yes | For our main experiments, we use 4 A100 40G for about 3 days. |
| Software Dependencies | No | Our implementation is based on Klein et al. (2018). |
| Experiment Setup | Yes | For our main experiments, we use L = 5 exemplars and M = 10 random seeds. We set nj to be a large value so that we always evaluate on the whole test set. ... ICL is performed on Llama-2-13B (Touvron et al., 2023) and Llama-2-7B to estimate ICT for MMLU tasks and FLAN tasks, respectively. ... In both datasets, we apply all methods to select k representative tasks with k from 1 to K (We pick K to be approximately 18% of the tasks, i.e., 10 on MMLU and 12 on FLAN). ... On MMLU, we use the following prompts: The following are multiple choice questions (with answers) about [Task A s subject]. [Task A s exemplars][Task B s question] Answer: An exemplar has the format: [Question] Answer: [Answer] On FLAN, we use the following prompts: You are a helpful AI assistant. Here are some example input-output pairs that you should follow. [Task A s exemplars]Input: [Task B s question] Output: An exemplar has the format: Input: [Question] Output: [Answer] |