Sample Efficient Demonstration Selection for In-Context Learning
Authors: Kiran Purohit, Venktesh V, Sourangshu Bhattacharya, Avishek Anand
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4. Experiments and Results We aim to address the following research questions: RQ I. How sample efficient is CASE compared to stateof-the-art exemplar selection and stochastic linear bandit methods? RQ II. Can CASE, choose task-level demonstration samples without sacrificing task performance, compared to stateof-the-art exemplar selection methods? RQ III. Can CASE work without exploration? 4.1. Experimental Setup Synthetic Experiments For synthetic experiments, we adopt a setup similar to (Réda et al., 2021) and present results on simulated data. ... Datasets and Metrics: We evaluate on well-known datasets that require numerical and commonsense reasoning. ... We report performance using the official metrics: Exact Match (EM) and Cover-EM ... Baselines: We compare against instance-level exemplar selection methods... |
| Researcher Affiliation | Academia | *Equal contribution. 1Indian Institute of Technology, Kharagpur, India 2Delft University of Technology (TU Delft), Netherlands. |
| Pseudocode | Yes | Algorithm 1 describes the proposed challenger-arm sampling based exploration technique, called CASE. ... Algorithm 1 CASE 1: Input: X: set of all training exemplars, k: prompt size, S : all k-subsets of X, a S: an arm or k-subset 2: Define: Ut: set of currently estimated top-m arms. ... 26: Output: UT : Set of m arms which have the highest reward |
| Open Source Code | Yes | We release our code and data. 1 1https://github.com/kiranpurohit/CASE |
| Open Datasets | Yes | Datasets and Metrics: We evaluate on well-known datasets that require numerical and commonsense reasoning. For numerical reasoning, we use GSM8K, Fin QA, Tab MWP and Aqua RAT; for commonsense reasoning, we use Strategy QA. Detailed descriptions of the datasets are provided in Appendix A.3. ... Table 4: Overview of the Complex QA datasets used in this study. |
| Dataset Splits | Yes | Detailed descriptions of the datasets are provided in Appendix A.3. ... Fin QA: This dataset has 1147 questions in the evaluation set. ... Aqua Rat: 100,000 algebraic word problems in the train set with dev and test set each comprising 254 problems. ... Strategy QA: we perform stratified sampling done on 2290 full train set to split into 1800 train and 490 test. |
| Hardware Specification | No | The paper mentions LLMs like Mistral-7b, Llama2-7b, gpt-3.5-turbo, and gpt-4o-mini as models being used or evaluated, but it does not specify the underlying hardware (e.g., GPU models, CPU types, or memory) on which these models were run for the experiments. |
| Software Dependencies | No | The paper mentions using 'Sentence BERT' and large language models (LLMs) like GPT-3.5-turbo, GPT-4o-mini, Mistral-7b, and Llama2-7b, but it does not provide specific version numbers for any software, libraries, or frameworks used in the implementation. |
| Experiment Setup | Yes | Hyperparameters (LLMs): For LLMs, we set the temperature to 0.3 to reduce randomness. To reduce repetition, we apply a presence penalty of 0.6 and a frequency penalty of 0.8. The max_length for generation is set to 900. Hyperparameters (CASE): ... We set m = |Ut| = 10 to identify the top scoring subsets (arms) and choose a challenger set Nt of size 5. We set the number of validation examples V to, 20 and ϵ = 0.1. |