reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sample Efficient Demonstration Selection for In-Context Learning

Authors: Kiran Purohit, Venktesh V, Sourangshu Bhattacharya, Avishek Anand

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4. Experiments and Results We aim to address the following research questions: RQ I. How sample efficient is CASE compared to stateof-the-art exemplar selection and stochastic linear bandit methods? RQ II. Can CASE, choose task-level demonstration samples without sacrificing task performance, compared to stateof-the-art exemplar selection methods? RQ III. Can CASE work without exploration? 4.1. Experimental Setup Synthetic Experiments For synthetic experiments, we adopt a setup similar to (Réda et al., 2021) and present results on simulated data. ... Datasets and Metrics: We evaluate on well-known datasets that require numerical and commonsense reasoning. ... We report performance using the official metrics: Exact Match (EM) and Cover-EM ... Baselines: We compare against instance-level exemplar selection methods...
Researcher Affiliation	Academia	*Equal contribution. 1Indian Institute of Technology, Kharagpur, India 2Delft University of Technology (TU Delft), Netherlands.
Pseudocode	Yes	Algorithm 1 describes the proposed challenger-arm sampling based exploration technique, called CASE. ... Algorithm 1 CASE 1: Input: X: set of all training exemplars, k: prompt size, S : all k-subsets of X, a S: an arm or k-subset 2: Define: Ut: set of currently estimated top-m arms. ... 26: Output: UT : Set of m arms which have the highest reward
Open Source Code	Yes	We release our code and data. 1 1https://github.com/kiranpurohit/CASE
Open Datasets	Yes	Datasets and Metrics: We evaluate on well-known datasets that require numerical and commonsense reasoning. For numerical reasoning, we use GSM8K, Fin QA, Tab MWP and Aqua RAT; for commonsense reasoning, we use Strategy QA. Detailed descriptions of the datasets are provided in Appendix A.3. ... Table 4: Overview of the Complex QA datasets used in this study.
Dataset Splits	Yes	Detailed descriptions of the datasets are provided in Appendix A.3. ... Fin QA: This dataset has 1147 questions in the evaluation set. ... Aqua Rat: 100,000 algebraic word problems in the train set with dev and test set each comprising 254 problems. ... Strategy QA: we perform stratified sampling done on 2290 full train set to split into 1800 train and 490 test.
Hardware Specification	No	The paper mentions LLMs like Mistral-7b, Llama2-7b, gpt-3.5-turbo, and gpt-4o-mini as models being used or evaluated, but it does not specify the underlying hardware (e.g., GPU models, CPU types, or memory) on which these models were run for the experiments.
Software Dependencies	No	The paper mentions using 'Sentence BERT' and large language models (LLMs) like GPT-3.5-turbo, GPT-4o-mini, Mistral-7b, and Llama2-7b, but it does not provide specific version numbers for any software, libraries, or frameworks used in the implementation.
Experiment Setup	Yes	Hyperparameters (LLMs): For LLMs, we set the temperature to 0.3 to reduce randomness. To reduce repetition, we apply a presence penalty of 0.6 and a frequency penalty of 0.8. The max_length for generation is set to 900. Hyperparameters (CASE): ... We set m = \|Ut\| = 10 to identify the top scoring subsets (arms) and choose a challenger set Nt of size 5. We set the number of validation examples V to, 20 and ϵ = 0.1.