reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Behavioral Exploration: Learning to Explore via In-Context Adaptation

Authors: Andrew Wagenmaker, Zhiyuan Zhou, Sergey Levine

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our method in both simulated locomotion and manipulation settings, as well as on real-world robotic manipulation tasks, illustrating its ability to learn adaptive, exploratory behavior. In our experimental evaluation, our focus is on understanding (a) whether BE is able to learn effective exploration strategies from offline demonstration data and adapt quickly online, (b) if BE is able to effectively focus its exploration over the space of behaviors present in the demonstration data, and (c) if BE scales to large-scale, real-world imitation learning (IL) settings. We first focus on RL benchmarks, where we compare against RL-based approaches to exploration, and then on IL, where we consider both simulated and real-world robotic tasks.
Researcher Affiliation	Academia	1Department of Electrical Engineering & Computer Science, University of California, Berkeley. Correspondence to: Andrew Wagenmaker <EMAIL>.
Pseudocode	No	The paper describes mathematical propositions (Proposition 4.2, Proposition A.1) and outlines an objective function (Equation 4) for Behavioral Exploration. It describes methods in paragraph text and illustrates with diagrams (Figure 1), but does not present any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper states: "For all experiments, for both BE and BC, we use the diffusion policy architecture proposed by Dasari et al. (2024) and utilize their code base as the starting point for our method." This indicates they used third-party code, but there is no explicit statement or link provided for their own implementation of Behavioral Exploration.
Open Datasets	Yes	For our RL experiments, we evaluate BE on a subset of the environments in the D4RL benchmark (Fu et al., 2020), focusing in particular on settings that require exploration. In simulation, we utilize the Libero benchmark (Liu et al., 2024), which simulates a variety of robotic manipulation and pick-and-place tasks, while in the real world, we train a policy for object manipulation on the Bridge dataset (Walke et al., 2023).
Dataset Splits	Yes	For Antmaze, we evaluate on the medium and large variants of the maze using the diverse offline dataset, and for each test with four distinct goal locations... For Kitchen, we utilize the partial variant of the offline data. We run all experiments on the Libero 90 dataset, which includes 90 tasks spread across 21 distinct scenes. For each task, the dataset provides 50 human demonstrations of successful completion... A trial consists of 5 consecutive episodes in the same scene.
Hardware Specification	Yes	This research used the Savio computational cluster resource provided by the Berkeley Research Computing program at UC Berkeley.
Software Dependencies	No	The paper states: "For all experiments, for both BE and BC, we use the diffusion policy architecture proposed by Dasari et al. (2024) and utilize their code base as the starting point for our method." This mentions a specific architecture but does not provide specific version numbers for software components (e.g., Python, PyTorch, specific library versions).
Experiment Setup	Yes	Table 1: Common hyperparameters for all BE and BC experiments. Hyperparameter: Learning rate Value: 3e-4; LR scheduler: cosine; Warmup steps: 2000. Table 3: Hyperparameters for D4RL BE experiments. Table 4: Hyperparameters for D4RL BC experiments. Table 7: Hyperparameters for Libero BE and BC models. Table 9: Hyperparameters for Widow X BE and BC models.