Continual Reinforcement Learning by Planning with Online World Models

Authors: Zichen Liu, Guoji Fu, Chao Du, Wee Sun Lee, Min Lin

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To assess OA, we further design Continual Bench, a dedicated environment for CRL, and compare with several strong baselines under the same model-planning algorithmic framework. The empirical results show that OA learns continuously to solve new tasks while not forgetting old skills, outperforming agents built on deep world models with various continual learning techniques.
Researcher Affiliation Collaboration 1Sea AI Lab 2National University of Singapore. Correspondence to: Min Lin <EMAIL>.
Pseudocode Yes Figure 7: OA learning and acting loop. Require: zero-initialized agent memories A(0), B(0) and world model weights W (0); sparse encoder ϕ : RS+A 7 RD; initial index s = [D]; planner CEM; sequence of tasks (Rτ) τ=1; initial state s1 ρ0; time step t = 1. 1: loop OA runs forever, updates per step 2: if task changes then 3: µt init values RA H 4: else 5: µt shifted µt 1 (fill the last column with init values) 6: end if 7: W (t) s (A(t 1) ss + 1λI) 1(B(t 1) s A(t 1) ss W (t 1)s ) 8: at, µt+1 CEM(st, W (t), µt, Rτ) Appendix A.4 for details 9: st+1 environment(st, at) 10: xt [st, at]; yt st+1 st 11: s nonzero_index(ϕ(xt)) 12: A(t) ss A(t 1) ss + ϕs(xt)ϕs(xt) 13: B(t) s B(t 1) s + ϕs(xt)y t 14: t t + 1 15: end loop
Open Source Code Yes We open source the code of Continual Bench1 and hope this realistic but lightweight environment can accelerate the progress of CRL research. Please see Appendix B for detailed environment specifications. 1https://github.com/sail-sg/Continual Bench
Open Datasets Yes In this work, we propose a lightweight but realistic benchmark environment, Continual Bench, that has a consistent state space and takes both forgetting and transfer into consideration. We open source the code of Continual Bench1 and hope this realistic but lightweight environment can accelerate the progress of CRL research. 1https://github.com/sail-sg/Continual Bench
Dataset Splits No The paper describes a continual reinforcement learning setting where an agent continuously solves a sequence of tasks. It mentions a fixed order of 6 tasks: (pick-place, button-press, door-open, peg-unplug, window-close, faucet-close). However, it does not specify traditional dataset splits (e.g., train/test/validation percentages or counts) for a static dataset, which is common in supervised learning. Instead, the tasks are presented sequentially for continuous learning and evaluation.
Hardware Specification Yes All experiments in the paper are run on the internal cluster, with each job consuming one A100 GPU and 16 CPUs.
Software Dependencies Yes We develop our agent and the baselines using the MBRL Library2 (Pineda et al., 2021) by Meta (MIT license, v0.2.0). 2https://github.com/facebookresearch/mbrl-lib
Experiment Setup Yes To build the sparse world model, we use 300 2-dimensional Losse features with Λ = 9. We find 1/λ = 0.005 as a good regularization strength without any tuning. For the planner, we use a candidate sampling size of N = 150 with planning horizon H = 15 and K = 3 iterations. The ratio of elite candidates is 0.1. In all above baselines, we use a 4-layer MLP with hidden size 200 and ReLU activation for world modeling. We use the Adam optimizer (Kingma & Ba, 2015) with learning rate of 4e-4 and minibatch size of 256. ... the deep models are trained until convergence over all available data (stopped when the validation loss on 5% holdout data does not improve over 5 consecutive epochs).