reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Speeding Up Language Model Evaluation

Authors: Jin Zhou, Christian Belardi, Ruihan Wu, Travis Zhang, Carla Gomes, Wen Sun, Kilian Weinberger

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the efficacy of these two algorithms, in addition to common non-active baselines, in a number of practical settings. Our empirical analysis shows that our two active selection algorithms are much better than all baselines. Notably, we observe that UCB-E works best for such easier settings; and UCB-E-LRF shines in harder settings, when the performance differences between methods are more subtle. In Section 4 titled 'EXPERIMENTS', the paper details evaluations on various datasets, using metrics like 'Top 1 Precision' and 'NDCG@K', and performs 'Ablations of the proposed algorithms and hyperparameters'.
Researcher Affiliation	Academia	Jin Peng Zhou Cornell University Christian K. Belardi Cornell University Ruihan Wu University of California, San Diego Travis Zhang Cornell University Carla P. Gomes Cornell University Wen Sun Cornell University Kilian Q. Weinberger Cornell University. All authors are affiliated with universities (Cornell University and University of California, San Diego), and the provided email addresses (EMAIL and EMAIL) also indicate academic affiliations.
Pseudocode	Yes	The paper includes several clearly labeled algorithm blocks: "Algorithm 1 UCB-E", "Algorithm 2 UCB-E WITH LOW-RANK FACTORIZATION (UCB-E-LRF)", "Algorithm 3 Row Mean Imputation", "Algorithm 4 Filled Subset", "Algorithm 5 LRF", and "Algorithm 6 UCB-E-LRF (Score Only)".
Open Source Code	Yes	Our code is available at https://github.com/kilian-group/banditeval. In Section 7, 'REPRODUCIBILITY STATEMENT', it reiterates: 'Our code is available at https://github.com/kilian-group/banditeval'.
Open Datasets	Yes	To assess the performance of our algorithms under a variety of use cases, we test with three datasets Alpaca Eval (Li et al., 2023), Grade School Math 8K (GSM8K) (Cobbe et al., 2021) and Physical Interaction: Question Answering (PIQA) (Bisk et al., 2020), together with different settings of the method set F and the scoring function s; Table 2 summarizes each dataset.
Dataset Splits	Yes	For GSM8K Prompts and PIQA Prompts... We evaluate the prompts on 784 and 1546 questions on the training set (about 10% of the size of the two training sets). For GSM8K Models and PIQA Models... For the examples, we randomly select 1000 questions from each dataset. Our algorithm UCB-E-LRF by default randomly evaluates 5% of the method-example pairs in the data matrix before selecting actively.
Hardware Specification	Yes	78 Nvidia A6000 GPU hours are needed for evaluating 205 zero-shot prompts on 784 GSM8K (Cobbe et al., 2021) questions using Mistral-7B (Jiang et al., 2023), Figure 1 (right). Depending on the dataset, some of these configurations experienced out-of-memory error on a Nvidia 3090 when we collected our data which we drop to simulate real-world scenarios.
Software Dependencies	No	The paper mentions various LLMs and models used, such as Mistral-7B, Tulu-7B, GPT2, Code LLa MA, Gemma-7B, Phi2, Llema-7B, LLa MA-2-7B, Star Coder7B. However, it does not specify software dependencies (e.g., programming languages, libraries, frameworks) with version numbers that would be required to reproduce the experimental setup.
Experiment Setup	Yes	UCB-E: We use a = 1 since this consistently yields the best performance across all datasets. UCB-E-LRF: For all datasets, we use rank r = 1 for low rank factorization with an ensemble size of C = 64. We use 5% of data for warm up i.e. T0 = 0.05 m n and η = 5. Finally, to take advantaged of the parallelism available in modern computing hardware, all algorithms and baselines are implemented with a batch size b. ... We use b = 32 for all experiments unless explicitly stated otherwise.