reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ELICIT: LLM Augmentation Via External In-context Capability

Authors: Futing Wang, Jianhao (Elliott) Yan, Yue Zhang, Tao Lin

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our comprehensive experiments and analysis demonstrate that our pipeline is highly transferable across different input formats, tasks, and model architectures. ELICIT serves as a plug-and-play performance booster to enable adaptive elicitation of model capabilities. By externally storing and reusing vectors that represent in-context learned capabilities, ELICIT not only demonstrates the potential to operate modular capabilities but also significantly enhances the performance, versatility, adaptability, and scalability of large language models.
Researcher Affiliation	Academia	Futing Wang 1, 2 Jianhao Yan 1, 2 Yue Zhang 2, 3 Tao Lin 2, 4 Zhejiang University 1 Westlake University 2 Institute of Advanced Technology, Westlake Institute for Advanced Study 3 Research Center for Industries of the Future, Westlake University 4 EMAIL
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks. It defines formal concepts but does not present step-by-step procedures in a pseudocode format.
Open Source Code	Yes	Our code is publicly available 1. 1https://github.com/LINs-lab/ELICIT
Open Datasets	Yes	Knowledge: Commonsense QA (Talmor et al., 2018), Open Book QA (Mihaylov et al., 2018), Hella Swag (Zellers et al., 2019), and Bool Q (Clark et al., 2019); Reasoning: Four subsets from Big-Bench Hard (BBH) (Suzgun et al., 2022) and ARCChallenge (Clark et al., 2018); Mathematics: Math QA (Amini et al., 2019) and MMLU Pro-MATH (Wang et al., 2024); Safety: Crows-Pairs (Nangia et al., 2020), BBQ-Age (Parrish et al., 2021), Ethics-Commonsense, and Ethics-Justice (Merity et al., 2016); Natural Language Understanding (NLU): GLUE (SST-2, QNLI, MNLI) (Wang, 2018) and Super GLUE (WIC, RTE) (Wang et al., 2019).
Dataset Splits	Yes	Our primary objective was to maintain robust evaluation capabilities while ensuring sufficient training data for ICL prompt construction. For datasets with pre-existing splits (ARCChallenge, Ethics, GLUE, Math QA, Openbook QA), we preserved the original partitioning. When handling datasets with only train-valid splits, we employed two approaches: for those with validation sets exceeding 350 samples (e.g., Bool Q, Hellaswag), we split the validation set into new validation and test sets at a 7:3 ratio; for those with smaller validation sets (e.g., Commonsense QA), we divided the training set into new train and test sets (7:3). For test-only datasets, we implemented different strategies based on size: smaller datasets like BBH (250 samples) were split to ensure 128 samples for training and 80-100 samples for testing, with remaining samples allocated to validation. Larger test-only datasets (>1000 samples) such as MMLU-Pro-Math, BBQ, and Crows Pairs were split into train-valid-test sets at a 7:2:1 ratio. The same 7:2:1 split was applied to train-only datasets like Super GLUE and Deep Mind.
Hardware Specification	No	The paper mentions the models used (Pythia-2.8B, LLaMA3-8B, Mistral-7B, Mamba-2.8B) and that experiments were supported by Westlake University Center for High-performance Computing. However, it does not provide specific hardware details such as exact GPU models, CPU types, or memory specifications used for the experiments.
Software Dependencies	No	The paper mentions using "huggingface implementations (Wolf et al., 2020)" and a classifier "built upon the Sim CSE Ro BERTa model2" (with footnote 2 referencing "princeton-nlp/sup-simcse-roberta-base"). However, it does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA, which are necessary for full reproducibility.
Experiment Setup	Yes	Our library contains k \|T \| items for each model, each consisting of three key components: (1) the ICL prompt pt i, (2) the corresponding task vector θ RL d, and (3) the pre-identified optimal layer l . Here, k denotes the number of ICL prompts for each task t, and we use k = 10 for illustration. [...] We obtain the task vector θ RL d by processing the ICL prompt pt i (defined in (1), using randomly selected N = 16 demonstrations). [...] Figure 3 provides a detailed visualization of how varying α affect both accuracy and cross-entropy loss in the Llama3-8B model across a diverse set of 20 tasks. Results reveal a clear trade-off between task performance and language modeling capability as intervention strength increases. Among the strategies tested, the additive approach hl = hl + 2 θl consistently demonstrates superior performance across a wide range of tasks while minimizing degradation in language modeling ability. [...] We fine-tuned this model over 15 epochs using a learning rate of 2e 5 on our curated dataset of 10,000 examples. The trained classifier is then used to compute similarity scores between a given query and each ICL prompt in our library. These scores are used to rank all library items, producing a similarity list of size k \|T \|. The top-ranked task vector from this list is selected as our target for further processing. [...] Our evaluation of various recall levels, as shown in Figures 4b, reveals that a recall of 0.8 provides the optimal balance for our pipeline, other models results shown in Appendix D. [...] We implement Dynamic Top-K Thresholding (DTT). If the highest similarity score exceeds the threshold, we select the top 10 task vectors from the ranked list for further processing. We then employ a majority voting mechanism among the optimal layers suggested by these top vectors to determine the final layer for intervention.