reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Large Language Models are Demonstration Pre-Selectors for Themselves

Authors: Jiarui Jin, Yuwei Wu, Haoxuan Li, Xiaoting He, Weinan Zhang, Yiming Yang, Yong Yu, Jun Wang, Mengyue Yang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments with LLMs ranging from 300M to 8B parameters show that FEEDER can reduce training data size by over 20% while maintaining performance and seamlessly integrating with various downstream demonstration selection strategies in ICL. Our empirical evaluations encompass six LLM bases, ranging from 335M to 7B parameters, and include six demonstration selectors in the demonstration selection stage, applied to text classification, reasoning, and semantic parsing tasks.
Researcher Affiliation	Collaboration	1Shanghai Jiao Tong University 2Xiaohongshu Inc. 3Carnegie Mellon University 4Peking University 5No Affiliation 6University College London 7University of Bristol.
Pseudocode	Yes	Algorithm 1 Bi-level Optimization Input: Training dataset DTRAIN, LLM ΨLLM. Output: Approximated subset e DFEEDER, tuned LLM ΨLLM. Algorithm 2 Approximation Algorithm for FEEDER Input: Training dataset DTRAIN. Output: An approximated FEEDER set e DFEEDER. Algorithm 3 Exact Algorithm for FEEDER Input: Training dataset DTRAIN. Output: An exact FEEDER set e DFEEDER. Algorithm 4 Alternative Exact Algorithm for FEEDER Input: Training dataset DTRAIN. Output: Exact FEEDER e DFEEDER.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or a link to a code repository.
Open Datasets	Yes	Our evaluations are mainly conducted on 6 text classification datasets: SST-2 (Socher et al., 2013), SST-5 (Socher et al., 2013), COLA (Warstadt et al., 2018), TREC (Voorhees & Tice, 2000), SUBJ (Pang & Lee, 2004), and FPB (Malo et al., 2014). We also further assess FEEDER on the reasoning dataset GSM8K (Cobbe et al., 2021), the semantic-parsing dataset SMCALFlow (Andreas et al., 2020), and the scientific question-answering dataset GPQA (Rein et al., 2024).
Dataset Splits	Yes	For each dataset, we directly follow the official splits to obtain DTRAIN and DTEST. We report both the mean and variance of accuracy using 8 different seeds and 5 different permutations of n-shots.
Hardware Specification	Yes	All our experiments are conducted with NVIDIA A100s.
Software Dependencies	No	The paper mentions "Sentence Transformers library2 from Hugging Face" but does not specify a version number for this library or any other software component used in the experiments.
Experiment Setup	Yes	The batch size is set as 32, the warm steps is set as 100, the learning rate is set as 5 × 10−5, and the weight decay is set as 0.01.