Reset-free Reinforcement Learning with World Models

Authors: Zhao Yang, Thomas M. Moerland, Mike Preuss, Aske Plaat, Edward S. Hu

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate three MBRL methods (PEG (Hu et al., 2023), the extension reset-free PEG and our proposed method Mo Re Free) and four competitive reset-free baselines on eight reset-free tasks. We aim to address the following questions: 1) Do MBRL approaches work well in reset-free tasks in terms of sample efficiency and performance? 2) What limitations arise from running MBRL in the reset-free setting, and does our proposed solution Mo Re Free address them? 3) What sorts of behavior do Mo Re Free and baselines exhibit in such tasks, and are our design choices for Mo Re Free justified?
Researcher Affiliation Academia Zhao Yang EMAIL The Leiden Institute of Advanced Computer Science Leiden University Thomas M. Moerland The Leiden Institute of Advanced Computer Science Leiden University Mike Preuss The Leiden Institute of Advanced Computer Science Leiden University Aske Plaat The Leiden Institute of Advanced Computer Science Leiden University Edward S. Hu GRASP Lab University of Pennsylvania
Pseudocode Yes Algorithm 1 Go-Explore 1: Input: g, πG θ , πE θ 2: τg {}; τe {} 3: for t = 1 to HG do 4: at πG θ ( |st, g) 5: st+1 T ( |st, at) 6: τg τg {st} 7: end for 8: for t = 1 to HE do 9: at πE θ ( |st) 10: st+1 T ( |st, at) 11: τe τe {st} 12: end for 13: return τg, τe
Open Source Code Yes Website: https://yangzhao-666.github.io/morefree
Open Datasets Yes We evaluate Mo Re Free and baselines on eight tasks (see Figure 3). We select five tasks from IBC s evaluation suite (Kim et al., 2023) of six tasks; (Point UMaze, Tabletop, Sawyer Door, Fetch Push and PP, Fetch Reach is omitted because it is trivially solvable). Next, we increased the complexity of the two hardest tasks from IBC, Fetch Push and Fetch Pick&Place, by extending the size of the workspace, replacing artificial workspace limits (which cause unrealistic jittering behavior near the limits, see the website for videos) with real walls, and evaluating on harder goal states (i.e. Pick&Place goals only in the air rather than including ones on the ground). In addition, we contributed a difficult locomotion task, Ant, which is adapted from the PEG codebase (Hu et al., 2023).
Dataset Splits No The paper describes goal and initial state distributions (e.g., ρ0 and ρg) for training and evaluation within a dynamic RL setting. For instance, 'The evaluation of agents is still episodic. The agent always starts from s0 ρ0, and is asked to achieve g ρg.' and 'First, we choose to sample evaluation goals from ρg [...] Next, [...] also samples initial states from ρ0 as goals for the Go-phase to emulate resetting behavior.' However, it does not specify static training, validation, or test dataset splits in the conventional sense of pre-partitioned data. Data is collected dynamically through interaction with the environment.
Hardware Specification Yes We submit jobs on a cluster with Nvidia 2080, 3090 and A100 GPUs.
Software Dependencies No The paper states: 'Our agent is built on the model-based go-explore method PEG (Hu et al., 2023), we extend their codebase...' and 'The RND implementation we follow is from DI-engine'. While it references existing codebases and a specific implementation, it does not provide a list of specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) for their own experimental setup.
Experiment Setup Yes Train ratio (i.e. Update to Data ratio) is an important hyper-parameter in MBRL. It controls how frequently the agent is trained. Every n steps, a batch of data is sampled from the replay buffer, the world model is trained on the batch, and then policies and value functions are trained in imagination. In all our experiments, we only vary n on different tasks. See the table below for different values on different tasks we used through experiments. Mo Re Free also introduces a new parameter α, which we keep α = 0.2 for all tasks and did not tune it at all. All other hyperparameters we keep the same as the original code base.