reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exploring exploration with foundation agents in interactive environments

Authors: Daniel P. Sawyer, Nan Rosemary Ke, Hubert Soyer, Martin Engelcke, John Reid, David P Reichert, Drew A. Hudson, Alexander Lerchner, Danilo Jimenez Rezende, Timothy P Lillicrap, Michael Curtis Mozer, Jane X Wang

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first evaluate foundation models in Feature World, a setting that primarily tests information gathering about a static hidden reward function. In this initial setting, we show that state-of-the-art foundation models come close to optimal efficiency... We performed experiments using Gemini 1.5 Pro and Flash... Our findings reveal a strong inherent exploratory capacity in foundation models across simple interactive settings.
Researcher Affiliation	Industry	Daniel P. Sawyer EMAIL Google Deep Mind Nan Rosemary Ke EMAIL Google Deep Mind Hubert Soyer EMAIL Google Deep Mind Martin Engelcke EMAIL Google Deep Mind John Reid EMAIL Google Deep Mind David P Reichert EMAIL Google Deep Mind Drew A. Hudson EMAIL Google Deep Mind Alexander Lerchner EMAIL Google Deep Mind Danilo Jimenez Rezende EMAIL Google Deep Mind Timothy P Lillicrap EMAIL Google Deep Mind Michael Mozer EMAIL Google Deep Mind Jane X Wang EMAIL Google Deep Mind
Pseudocode	No	The paper describes methodologies and experimental procedures in natural language, but does not contain any structured pseudocode or algorithm blocks. Examples of this include descriptions of the optimal strategy for Feature World in Section A.2.1 and the summarization strategy in Section A.4.2.
Open Source Code	No	We performed experiments using Gemini 1.5 Pro and Flash (Reid et al., 2024), Gemini 2.5 Pro and Flash (Google, 2025), Claude 3.7 Sonnet (Anthropic, 2025), and Chat GPT-4o (Open AI, 2024) and o4-mini (Open AI, 2025). We use the default settings for the public APIs in all cases unless noted otherwise. There is no explicit statement or link indicating that the authors' implementation code for their methodology is open-sourced.
Open Datasets	Yes	We evaluate LLMs in three environments: text-based and multimodal variants of Feature World, and a text-based version of Alchemy (Wang et al., 2021). Alchemy (Wang et al., 2021) is a procedurally generated environment specifically created to test meta-learning capabilities.
Dataset Splits	No	We perform experiments using Gemini 1.5 Pro and Flash (Reid et al., 2024), Gemini 2.5 Pro and Flash (Google, 2025), Claude 3.7 Sonnet (Anthropic, 2025), and Chat GPT-4o (Open AI, 2024) and o4-mini (Open AI, 2025). For all metrics, we run 10 replicate episodes with randomized chemistries. We collect a total of 15 trajectories for each agent type. The paper describes the number of replicates and trajectories for experiments but does not specify traditional train/test/validation dataset splits as the environments are procedurally generated for each episode or trajectory.
Hardware Specification	No	We performed experiments using Gemini 1.5 Pro and Flash (Reid et al., 2024), Gemini 2.5 Pro and Flash (Google, 2025), Claude 3.7 Sonnet (Anthropic, 2025), and Chat GPT-4o (Open AI, 2024) and o4-mini (Open AI, 2025). The paper mentions using public APIs for the foundation models and a 3D simulation environment, but it does not specify the hardware (e.g., specific GPU/CPU models, memory) on which these experiments were run or the simulations were executed.
Software Dependencies	No	We performed experiments using Gemini 1.5 Pro and Flash (Reid et al., 2024), Gemini 2.5 Pro and Flash (Google, 2025), Claude 3.7 Sonnet (Anthropic, 2025), and Chat GPT-4o (Open AI, 2024) and o4-mini (Open AI, 2025). The paper lists the foundation models used but does not provide specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks) for their experimental setup.
Experiment Setup	Yes	We use the default settings for the public APIs in all cases unless noted otherwise. For Feature World, we compared Chat GPT-4o (Achiam et al., 2023; Open AI, 2024) (200k context), Claude 3.7 Sonnet (Anthropic, 2025) (200k context), Gemini 1.5 Flash and Pro (Reid et al., 2024), and Gemini 2.5 Flash and Pro (Google, 2025) (1M context). For all experiments, we found 200k context to be sufficient. To evaluate information gathering efficiency, we assess how often models are successful at finding a rewarding object given a fixed budget of exploration steps. We set the step budget as the maximum number of steps that an optimal policy would need before finding at least one rewarding object. We measure model performance primarily through three metrics: 1) performance: mean score over the 10 trials of an episode, 2) improvement: difference of the mean score of the last 5 trials and the score of the first trial, and 3) adaptation: the mean score for 10 trials following an unexpected change in chemistry. We assess two variables impacting the ability of models to solve the Alchemy task: 1) inclusion of prior information on invariant principles of Alchemy in the prompt (see Section A.4.1 for details), and 2) use of summarization to augment model learning across trials (see Section A.4.2 for details).