reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

EVOLvE: Evaluating and Optimizing LLMs For In-Context Exploration

Authors: Allen Nie, Yi Su, Bo Chang, Jonathan Lee, Ed H. Chi, Quoc V Le, Minmin Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we measure LLMs (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. We develop a comprehensive suite of environments, including both context-free and contextual bandits with varying task difﬁculties, to benchmark LLMs performance. Motivated by the existence of optimal exploration algorithms, we propose efﬁcient ways to integrate this algorithmic knowledge into LLMs... We conducted an extensive ablation study to shed light on various factors, such as task difﬁculty and data representation, that inﬂuence the efﬁciency of LLM exploration.
Researcher Affiliation	Collaboration	Allen Nie * 1 Yi Su * 2 Bo Chang * 2 Jonathan N. Lee 2 Ed H. Chi 2 Quoc V. Le 2 Minmin Chen 2 *Equal contribution 1Stanford University 2Google Deep Mind. Correspondence to: Allen Nie <EMAIL>, Yi Su <EMAIL>.
Pseudocode	No	The paper describes the UCB and Lin UCB algorithms and their mathematical formulations in Section 5, but it does not present them in structured pseudocode or algorithm blocks.
Open Source Code	Yes	Bandit Bench and the inference code have been provided in this Git Hub repo and will be updated/monitored regularly: https://github.com/allenanie/EVOLvE. You can install the code with: pip install banditbench.
Open Datasets	Yes	We use the Movie Lens-1M dataset (Harper & Konstan, 2015) to build the contextual bandit task.
Dataset Splits	No	For CB, we use a ﬁxed dataset and evaluate the LLM s performance on a held-out set of users. While these users are unseen during training, their proﬁles and preferences remain within the distribution of the training data. The paper mentions a 'held-out set of users' for CB tasks but does not specify explicit percentages or counts for training, validation, or test splits for any dataset used.
Hardware Specification	No	The paper mentions various models like Gemma-2B, Gemma-9B, Gemini 1.5 Flash, Gemini 1.5 Pro, GPT-4o, and Claude-3.5-sonnet, and discusses evaluation costs, but it does not specify the underlying hardware (e.g., GPU, CPU models) used for training or inference of these models or the experiments.
Software Dependencies	No	The paper mentions 'pip install banditbench' for installing their code and references 'Scikit-learn (Pedregosa et al., 2011)' for fitting functions. However, it does not provide specific version numbers for Python, other libraries, or software dependencies required to reproduce the experiments.
Experiment Setup	Yes	For MAB tasks, the interaction horizon (T) differs based on the size of the action space (K): we use T = 1000 for K = 30 and T = 200 for K = 10. All CB tasks use a constant horizon of 200 steps... We set the random seed to be the same as trial id, starting from 0 to 29. For the LLM calls, we use standard API calls and set the sampling temperature to 1.0 (range=[0.0, 2.0]). The default API (2024-08 to 2024-09) uses Top-P=0.95 sampling, and Top-K=40.