Online Reinforcement Learning in Non-Stationary Context-Driven Environments
Authors: Pouya Hamadanian, Arash Nasr-Esfahany, Malte Schwarzkopf, Siddhartha Sen, Mohammad Alizadeh
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate LCPO in Mujoco, classic control and computer systems environments with a variety of synthetic and real context traces, and find that it outperforms a variety of baselines in the non-stationary setting... |
| Researcher Affiliation | Collaboration | Pouya Hamadanian MIT CSAIL EMAIL Arash Nasr-Esfahany MIT CSAIL EMAIL Malte Schwarzkopf CS Brown University EMAIL Siddhartha Sen Microsoft Research EMAIL Mohammad Alizadeh MIT CSAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 LCPO Training 1: initialize parameter vectors θ0, empty buffer Ba 2: for each iteration do 3: Br Sample a mini-batch of new interactions 4: Sc W(Ba,Br) 5: v θLtot(θ;Br)|θ0 6: if Sc is not empty then 7: g(x):= θ(x T θDKL(θold,θ;Sc)|θ0)|θ0 8: vc conjgrad(v,g( )) 9: while θold+vc violates constraints do 10: vc vc/2 11: θ0 θ0+vc 12: else 13: θ0 θ0+v 14: Ba Ba+Br |
| Open Source Code | Yes | LCPO s source code is available at https://github.com/pouyahmdn/LCPO. |
| Open Datasets | Yes | We consider six environments: Modified versions of Pendulum-v1 from the classic control environments, Inverted Pendulum-v4, Inverted Double Pendulum-v4, Hopper-v4 and Reacher-v4 from the Mujoco environments (Towers et al., 2023), and a straggler mitigation environment (Hamadanian et al., 2022). |
| Dataset Splits | No | The paper describes experimental procedures and data generation (e.g., 'warm-up period of 6 million time steps', 'Context traces 1 and 2 are 20 million...'), but does not specify explicit training/test/validation dataset splits typically found in static dataset evaluations. In online reinforcement learning, data is generated sequentially, rather than being pre-split. |
| Hardware Specification | Yes | These experiments were conducted on a machine with 2 AMD EPYC 7763 CPUs (256 logical cores) and 512 Gi B of RAM. With 32 concurrent runs, experiments finished in 1152 hours. |
| Software Dependencies | Yes | We use Gymnasium (v0.29.1, MIT license) and Mujoco (v3.1.1, Apache-2.0 license). Our baseline and LCPO implementations use the Pytorch (Paszke et al., 2019) (v1.13.1, BSD-style license) library. |
| Experiment Setup | Yes | Table 11 is a comprehensive list of all hyperparameters used in training and the environment. |