reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

BenTo: Benchmark Reduction with In-Context Transferability

Authors: Hongyu Zhao, Ming Li, Lichao Sun, Tianyi Zhou

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluating large language models (LLMs) is costly: it requires the generation and examination of LLM outputs on a large-scale benchmark of various tasks. This paper investigates how to efficiently reduce the tasks used to benchmark LLMs without affecting the evaluation quality. Our study reveals that task transferability and relevance provide critical information to identify the most representative subset of tasks via optimizing a facility location function. We propose a practically efficient metric for estimating the transferability between two tasks via in-context learning (ICL). By analyzing the pairwise transferability, we can reduce tasks in a modern LLM benchmark (e.g., MMLU or FLAN) to 5% while inducing only a < 4% difference to the evaluation on the original benchmark. Extensive experiments are conducted to evaluate the effectiveness of BENTO-reduced benchmark by comparing the performance of several widely used LLMs on both the reduced and original benchmarks. Remarkably, as is shown in Figure 1 (RIGHT) and Table 1, the results are highly consistent, even though the reduced benchmark comprises only 5% of the original tasks.
Researcher Affiliation	Academia	Hongyu Zhao1, Ming Li1, Lichao Sun2, Tianyi Zhou1 1University of Maryland, College Park 2Lehigh University EMAIL
Pseudocode	Yes	A ALGORITHM Our method BENTO is described in Algorithm 1. Algorithm 1 Benchmark Task Reduction (BENTO)
Open Source Code	Yes	Project: https://github.com/tianyi-lab/bento
Open Datasets	Yes	Benchmarks. We assess our method mainly on two benchmarks: MMLU (Hendrycks et al., 2021a;b) and FLAN (Wei et al., 2021). ... AGIEval (Zhong et al., 2023) is a question-answering dataset... Big-Bench Hard (Suzgun et al., 2022) is a benchmark with 27 subtasks...
Dataset Splits	No	Benchmarks. We assess our method mainly on two benchmarks: MMLU (Hendrycks et al., 2021a;b) and FLAN (Wei et al., 2021). MMLU is a question-answering dataset containing 57 tasks with diverse subjects and levels of difficulty, mainly focusing on the knowledge of the language models. All the questions in MMLU are multiple-choice questions with 4 options for each answer. On MMLU, we use accuracy (ACC) as the evaluation metric s( , ) in Equation (1). FLAN is a dataset with more diverse forms of tasks, including free-form generation tasks like translation and summarization. Since FLAN is a large dataset, we sampled 100 questions from each of its 66 tasks as our whole benchmark. On FLAN, we use response perplexity as the evaluation metric s( , ). We follow widely used prompts for these benchmarks without any re-engineering, more details in Appendix D. ... On this dataset, we only use the 3 to 5 examples in the training set as the exemplars.
Hardware Specification	Yes	For our main experiments, we use 4 A100 40G for about 3 days.
Software Dependencies	No	Our implementation is based on Klein et al. (2018).
Experiment Setup	Yes	For our main experiments, we use L = 5 exemplars and M = 10 random seeds. We set nj to be a large value so that we always evaluate on the whole test set. ... ICL is performed on Llama-2-13B (Touvron et al., 2023) and Llama-2-7B to estimate ICT for MMLU tasks and FLAN tasks, respectively. ... In both datasets, we apply all methods to select k representative tasks with k from 1 to K (We pick K to be approximately 18% of the tasks, i.e., 10 on MMLU and 12 on FLAN). ... On MMLU, we use the following prompts: The following are multiple choice questions (with answers) about [Task A s subject]. [Task A s exemplars][Task B s question] Answer: An exemplar has the format: [Question] Answer: [Answer] On FLAN, we use the following prompts: You are a helpful AI assistant. Here are some example input-output pairs that you should follow. [Task A s exemplars]Input: [Task B s question] Output: An exemplar has the format: Input: [Question] Output: [Answer]