reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Demonstration Selection for In-Context Learning via Reinforcement Learning

Authors: Xubin Wang, Jianfei Wu, Yuan Yichen, Deyu Cai, Mingzhe Li, Weijia Jia

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on multiple benchmark datasets, including diverse reasoning tasks, and involving 14 closed-source and open-source LLMs, we demonstrate that RDES significantly enhances performance compared to ten established baselines. Our evaluation includes analysis of performance across varying numbers of demonstrations on selected datasets. Furthermore, we investigate incorporating Chain-of-Thought (Co T) reasoning, which further boosts predictive performance.
Researcher Affiliation	Academia	1BNU-BNBU Institute of Artificial Intelligence and Future Networks, Beijing Normal-Hong Kong Baptist University, Zhuhai, China 2Hong Kong Baptist University, Hong Kong, China 3Beijing Normal University at Zhuhai, Zhuhai, China. Correspondence to: Weijia Jia <EMAIL>, Xubin Wang <EMAIL>.
Pseudocode	Yes	Algorithm 1 RDES Training Framework Require: Knowledge base K, Test inputs Dtest, LLM M, RL algorithm A 1: Initialize selection policy π (Q-table or neural networks) 2: Precompute TF-IDF vectors for each sample xi Dtest and for knowledge base K 3: for i = 1 to N do 4: Sample input xi Dtest 5: Select demonstrations E with initial candidates based on relevance (e.g., top-k TF-IDF matches from K) and apply diversity adjustment. 6: Generate prompt p = Format(xi, E) 7: Obtain prediction ˆy = M(p) 8: Compute diversity score D = \|L(E)\| k 9: Encode state s = ϕ(xi, E, ˆy, D) 10: Select action a π(s) (example index from K) 11: Calculate reward r = I(ytrue = ˆy) + λ(Dnew Dold) 12: Update policy parameters θ using A with (s, a, r) 13: end for 14: Return: Optimized policy π
Open Source Code	No	The paper does not provide an explicit statement about releasing its source code or a link to a code repository.
Open Datasets	Yes	In our framework evaluation, we utilize four widely recognized datasets that encompass a diverse range of domains and intents. The BANKING77 dataset (Casanueva et al., 2020) provides a comprehensive set of intents specifically relevant to the banking sector. Additionally, the HWU64 (Liu et al., 2021) and LIU54 (Liu et al., 2021) datasets offer extensive multi-domain coverage, making them particularly valuable for comparative analysis. We also include the CLINC150 dataset (Larson et al., 2019), which further enriches our evaluation framework... To further assess the generalizability and effectiveness of RDES on tasks requiring more complex reasoning, we also conducted additional experiments on challenging benchmarks. These include subsets of Big Bench Hard (Suzgun et al., 2023) (specifically, boolean expressions and web of lies), GSM-8K (Cobbe et al., 2021), and SST5 (Socher et al., 2013).
Dataset Splits	No	To better align our evaluation with real-world application scenarios, we employed a challenge set sampling strategy, drawing on the principles outlined in (Lu et al., 2024). This approach allowed us to select a demanding subset from the original test splits based on the precision margin, ensuring a rigorous assessment of our model s performance. ... Due to time constraints, we randomly sampled 1,000 examples from the test sets for evaluation. This only discusses test set sampling, not the full train/validation/test splits.
Hardware Specification	No	The paper refers to using various LLMs (e.g., GPT-3.5-turbo, Qwen-2.5-72B, Deep Seek-R1-32B) but does not provide specific details about the hardware (GPU/CPU models, memory) on which their experiments were conducted.
Software Dependencies	No	The paper does not provide specific software dependencies (e.g., library names with version numbers) used for the implementation.
Experiment Setup	No	The paper describes the general methodology including RL formulation, Q-learning approach, and PPO variant, and mentions an annealing schedule for lambda. However, it does not provide specific hyperparameter values (e.g., learning rates, batch sizes, number of epochs, optimizer settings) needed to reproduce the experimental setup.