Demonstration Selection for In-Context Learning via Reinforcement Learning
Authors: Xubin Wang, Jianfei Wu, Yuan Yichen, Deyu Cai, Mingzhe Li, Weijia Jia
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments on multiple benchmark datasets, including diverse reasoning tasks, and involving 14 closed-source and open-source LLMs, we demonstrate that RDES significantly enhances performance compared to ten established baselines. Our evaluation includes analysis of performance across varying numbers of demonstrations on selected datasets. Furthermore, we investigate incorporating Chain-of-Thought (Co T) reasoning, which further boosts predictive performance. |
| Researcher Affiliation | Academia | 1BNU-BNBU Institute of Artificial Intelligence and Future Networks, Beijing Normal-Hong Kong Baptist University, Zhuhai, China 2Hong Kong Baptist University, Hong Kong, China 3Beijing Normal University at Zhuhai, Zhuhai, China. Correspondence to: Weijia Jia <EMAIL>, Xubin Wang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 RDES Training Framework Require: Knowledge base K, Test inputs Dtest, LLM M, RL algorithm A 1: Initialize selection policy π (Q-table or neural networks) 2: Precompute TF-IDF vectors for each sample xi Dtest and for knowledge base K 3: for i = 1 to N do 4: Sample input xi Dtest 5: Select demonstrations E with initial candidates based on relevance (e.g., top-k TF-IDF matches from K) and apply diversity adjustment. 6: Generate prompt p = Format(xi, E) 7: Obtain prediction ˆy = M(p) 8: Compute diversity score D = |L(E)| k 9: Encode state s = ϕ(xi, E, ˆy, D) 10: Select action a π(s) (example index from K) 11: Calculate reward r = I(ytrue = ˆy) + λ(Dnew Dold) 12: Update policy parameters θ using A with (s, a, r) 13: end for 14: Return: Optimized policy π |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its source code or a link to a code repository. |
| Open Datasets | Yes | In our framework evaluation, we utilize four widely recognized datasets that encompass a diverse range of domains and intents. The BANKING77 dataset (Casanueva et al., 2020) provides a comprehensive set of intents specifically relevant to the banking sector. Additionally, the HWU64 (Liu et al., 2021) and LIU54 (Liu et al., 2021) datasets offer extensive multi-domain coverage, making them particularly valuable for comparative analysis. We also include the CLINC150 dataset (Larson et al., 2019), which further enriches our evaluation framework... To further assess the generalizability and effectiveness of RDES on tasks requiring more complex reasoning, we also conducted additional experiments on challenging benchmarks. These include subsets of Big Bench Hard (Suzgun et al., 2023) (specifically, boolean expressions and web of lies), GSM-8K (Cobbe et al., 2021), and SST5 (Socher et al., 2013). |
| Dataset Splits | No | To better align our evaluation with real-world application scenarios, we employed a challenge set sampling strategy, drawing on the principles outlined in (Lu et al., 2024). This approach allowed us to select a demanding subset from the original test splits based on the precision margin, ensuring a rigorous assessment of our model s performance. ... Due to time constraints, we randomly sampled 1,000 examples from the test sets for evaluation. This only discusses test set sampling, not the full train/validation/test splits. |
| Hardware Specification | No | The paper refers to using various LLMs (e.g., GPT-3.5-turbo, Qwen-2.5-72B, Deep Seek-R1-32B) but does not provide specific details about the hardware (GPU/CPU models, memory) on which their experiments were conducted. |
| Software Dependencies | No | The paper does not provide specific software dependencies (e.g., library names with version numbers) used for the implementation. |
| Experiment Setup | No | The paper describes the general methodology including RL formulation, Q-learning approach, and PPO variant, and mentions an annealing schedule for lambda. However, it does not provide specific hyperparameter values (e.g., learning rates, batch sizes, number of epochs, optimizer settings) needed to reproduce the experimental setup. |