ELICIT: LLM Augmentation Via External In-context Capability
Authors: Futing Wang, Jianhao (Elliott) Yan, Yue Zhang, Tao Lin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our comprehensive experiments and analysis demonstrate that our pipeline is highly transferable across different input formats, tasks, and model architectures. ELICIT serves as a plug-and-play performance booster to enable adaptive elicitation of model capabilities. By externally storing and reusing vectors that represent in-context learned capabilities, ELICIT not only demonstrates the potential to operate modular capabilities but also significantly enhances the performance, versatility, adaptability, and scalability of large language models. |
| Researcher Affiliation | Academia | Futing Wang 1, 2 Jianhao Yan 1, 2 Yue Zhang 2, 3 Tao Lin 2, 4 Zhejiang University 1 Westlake University 2 Institute of Advanced Technology, Westlake Institute for Advanced Study 3 Research Center for Industries of the Future, Westlake University 4 EMAIL |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. It defines formal concepts but does not present step-by-step procedures in a pseudocode format. |
| Open Source Code | Yes | Our code is publicly available 1. 1https://github.com/LINs-lab/ELICIT |
| Open Datasets | Yes | Knowledge: Commonsense QA (Talmor et al., 2018), Open Book QA (Mihaylov et al., 2018), Hella Swag (Zellers et al., 2019), and Bool Q (Clark et al., 2019); Reasoning: Four subsets from Big-Bench Hard (BBH) (Suzgun et al., 2022) and ARCChallenge (Clark et al., 2018); Mathematics: Math QA (Amini et al., 2019) and MMLU Pro-MATH (Wang et al., 2024); Safety: Crows-Pairs (Nangia et al., 2020), BBQ-Age (Parrish et al., 2021), Ethics-Commonsense, and Ethics-Justice (Merity et al., 2016); Natural Language Understanding (NLU): GLUE (SST-2, QNLI, MNLI) (Wang, 2018) and Super GLUE (WIC, RTE) (Wang et al., 2019). |
| Dataset Splits | Yes | Our primary objective was to maintain robust evaluation capabilities while ensuring sufficient training data for ICL prompt construction. For datasets with pre-existing splits (ARCChallenge, Ethics, GLUE, Math QA, Openbook QA), we preserved the original partitioning. When handling datasets with only train-valid splits, we employed two approaches: for those with validation sets exceeding 350 samples (e.g., Bool Q, Hellaswag), we split the validation set into new validation and test sets at a 7:3 ratio; for those with smaller validation sets (e.g., Commonsense QA), we divided the training set into new train and test sets (7:3). For test-only datasets, we implemented different strategies based on size: smaller datasets like BBH (250 samples) were split to ensure 128 samples for training and 80-100 samples for testing, with remaining samples allocated to validation. Larger test-only datasets (>1000 samples) such as MMLU-Pro-Math, BBQ, and Crows Pairs were split into train-valid-test sets at a 7:2:1 ratio. The same 7:2:1 split was applied to train-only datasets like Super GLUE and Deep Mind. |
| Hardware Specification | No | The paper mentions the models used (Pythia-2.8B, LLaMA3-8B, Mistral-7B, Mamba-2.8B) and that experiments were supported by Westlake University Center for High-performance Computing. However, it does not provide specific hardware details such as exact GPU models, CPU types, or memory specifications used for the experiments. |
| Software Dependencies | No | The paper mentions using "huggingface implementations (Wolf et al., 2020)" and a classifier "built upon the Sim CSE Ro BERTa model2" (with footnote 2 referencing "princeton-nlp/sup-simcse-roberta-base"). However, it does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA, which are necessary for full reproducibility. |
| Experiment Setup | Yes | Our library contains k |T | items for each model, each consisting of three key components: (1) the ICL prompt pt i, (2) the corresponding task vector θ RL d, and (3) the pre-identified optimal layer l . Here, k denotes the number of ICL prompts for each task t, and we use k = 10 for illustration. [...] We obtain the task vector θ RL d by processing the ICL prompt pt i (defined in (1), using randomly selected N = 16 demonstrations). [...] Figure 3 provides a detailed visualization of how varying α affect both accuracy and cross-entropy loss in the Llama3-8B model across a diverse set of 20 tasks. Results reveal a clear trade-off between task performance and language modeling capability as intervention strength increases. Among the strategies tested, the additive approach hl = hl + 2 θl consistently demonstrates superior performance across a wide range of tasks while minimizing degradation in language modeling ability. [...] We fine-tuned this model over 15 epochs using a learning rate of 2e 5 on our curated dataset of 10,000 examples. The trained classifier is then used to compute similarity scores between a given query and each ICL prompt in our library. These scores are used to rank all library items, producing a similarity list of size k |T |. The top-ranked task vector from this list is selected as our target for further processing. [...] Our evaluation of various recall levels, as shown in Figures 4b, reveals that a recall of 0.8 provides the optimal balance for our pipeline, other models results shown in Appendix D. [...] We implement Dynamic Top-K Thresholding (DTT). If the highest similarity score exceeds the threshold, we select the top 10 task vectors from the ranked list for further processing. We then employ a majority voting mechanism among the optimal layers suggested by these top vectors to determine the final layer for intervention. |