Online Reinforcement Learning in Non-Stationary Context-Driven Environments

Authors: Pouya Hamadanian, Arash Nasr-Esfahany, Malte Schwarzkopf, Siddhartha Sen, Mohammad Alizadeh

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate LCPO in Mujoco, classic control and computer systems environments with a variety of synthetic and real context traces, and find that it outperforms a variety of baselines in the non-stationary setting...
Researcher Affiliation Collaboration Pouya Hamadanian MIT CSAIL EMAIL Arash Nasr-Esfahany MIT CSAIL EMAIL Malte Schwarzkopf CS Brown University EMAIL Siddhartha Sen Microsoft Research EMAIL Mohammad Alizadeh MIT CSAIL EMAIL
Pseudocode Yes Algorithm 1 LCPO Training 1: initialize parameter vectors θ0, empty buffer Ba 2: for each iteration do 3: Br Sample a mini-batch of new interactions 4: Sc W(Ba,Br) 5: v θLtot(θ;Br)|θ0 6: if Sc is not empty then 7: g(x):= θ(x T θDKL(θold,θ;Sc)|θ0)|θ0 8: vc conjgrad(v,g( )) 9: while θold+vc violates constraints do 10: vc vc/2 11: θ0 θ0+vc 12: else 13: θ0 θ0+v 14: Ba Ba+Br
Open Source Code Yes LCPO s source code is available at https://github.com/pouyahmdn/LCPO.
Open Datasets Yes We consider six environments: Modified versions of Pendulum-v1 from the classic control environments, Inverted Pendulum-v4, Inverted Double Pendulum-v4, Hopper-v4 and Reacher-v4 from the Mujoco environments (Towers et al., 2023), and a straggler mitigation environment (Hamadanian et al., 2022).
Dataset Splits No The paper describes experimental procedures and data generation (e.g., 'warm-up period of 6 million time steps', 'Context traces 1 and 2 are 20 million...'), but does not specify explicit training/test/validation dataset splits typically found in static dataset evaluations. In online reinforcement learning, data is generated sequentially, rather than being pre-split.
Hardware Specification Yes These experiments were conducted on a machine with 2 AMD EPYC 7763 CPUs (256 logical cores) and 512 Gi B of RAM. With 32 concurrent runs, experiments finished in 1152 hours.
Software Dependencies Yes We use Gymnasium (v0.29.1, MIT license) and Mujoco (v3.1.1, Apache-2.0 license). Our baseline and LCPO implementations use the Pytorch (Paszke et al., 2019) (v1.13.1, BSD-style license) library.
Experiment Setup Yes Table 11 is a comprehensive list of all hyperparameters used in training and the environment.