EVOLvE: Evaluating and Optimizing LLMs For In-Context Exploration
Authors: Allen Nie, Yi Su, Bo Chang, Jonathan Lee, Ed H. Chi, Quoc V Le, Minmin Chen
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we measure LLMs (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. We develop a comprehensive suite of environments, including both context-free and contextual bandits with varying task difficulties, to benchmark LLMs performance. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs... We conducted an extensive ablation study to shed light on various factors, such as task difficulty and data representation, that influence the efficiency of LLM exploration. |
| Researcher Affiliation | Collaboration | Allen Nie * 1 Yi Su * 2 Bo Chang * 2 Jonathan N. Lee 2 Ed H. Chi 2 Quoc V. Le 2 Minmin Chen 2 *Equal contribution 1Stanford University 2Google Deep Mind. Correspondence to: Allen Nie <EMAIL>, Yi Su <EMAIL>. |
| Pseudocode | No | The paper describes the UCB and Lin UCB algorithms and their mathematical formulations in Section 5, but it does not present them in structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Bandit Bench and the inference code have been provided in this Git Hub repo and will be updated/monitored regularly: https://github.com/allenanie/EVOLvE. You can install the code with: pip install banditbench. |
| Open Datasets | Yes | We use the Movie Lens-1M dataset (Harper & Konstan, 2015) to build the contextual bandit task. |
| Dataset Splits | No | For CB, we use a fixed dataset and evaluate the LLM s performance on a held-out set of users. While these users are unseen during training, their profiles and preferences remain within the distribution of the training data. The paper mentions a 'held-out set of users' for CB tasks but does not specify explicit percentages or counts for training, validation, or test splits for any dataset used. |
| Hardware Specification | No | The paper mentions various models like Gemma-2B, Gemma-9B, Gemini 1.5 Flash, Gemini 1.5 Pro, GPT-4o, and Claude-3.5-sonnet, and discusses evaluation costs, but it does not specify the underlying hardware (e.g., GPU, CPU models) used for training or inference of these models or the experiments. |
| Software Dependencies | No | The paper mentions 'pip install banditbench' for installing their code and references 'Scikit-learn (Pedregosa et al., 2011)' for fitting functions. However, it does not provide specific version numbers for Python, other libraries, or software dependencies required to reproduce the experiments. |
| Experiment Setup | Yes | For MAB tasks, the interaction horizon (T) differs based on the size of the action space (K): we use T = 1000 for K = 30 and T = 200 for K = 10. All CB tasks use a constant horizon of 200 steps... We set the random seed to be the same as trial id, starting from 0 to 29. For the LLM calls, we use standard API calls and set the sampling temperature to 1.0 (range=[0.0, 2.0]). The default API (2024-08 to 2024-09) uses Top-P=0.95 sampling, and Top-K=40. |