Sample Complexity of Distributionally Robust Off-Dynamics Reinforcement Learning with Online Interaction

Authors: Yiting He, Zhishuai Liu, Weixin Wang, Pan Xu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we validate our theoretical results through comprehensive numerical experiments. ... We conduct comprehensive numerical experiments to validate our theoretical findings. In a simulated MDP, we show that the performance of learned policies degrades as Cvr increases. We evaluate our algorithms in a simulated RMDP and the Frozen Lake environment, highlighting their effectiveness when distribution shifts are significant.
Researcher Affiliation Academia 1Duke University. Correspondence to: Pan Xu <EMAIL>.
Pseudocode Yes Algorithm 1 Online Robust Bellman Iteration (ORBIT) ... Algorithm 2 A more efficient solver for the CRMDP-TV Setting
Open Source Code Yes The implementation of our ORBIT algorithm is available at https://github.com/panxulab/Online-Robust-Bellman-Iteration.
Open Datasets Yes Now we test our algorithm in a hard-to-explore setting, the Frozen Lake problem. ... We use the default map in the Open AI Gym library, which is illustrated in Example A.1
Dataset Splits No The paper describes online interaction with environments (simulated MDPs, Frozen Lake) for K episodes and evaluates the learned policies in target environments with different perturbation rates. However, it does not provide specific training/test/validation dataset splits in the traditional sense, as data is generated dynamically through interaction.
Hardware Specification Yes All numerical experiments were conducted on a server equipped with Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz.
Software Dependencies No The paper mentions using the 'Open AI Gym library' but does not specify a version number for it or any other key software dependencies.
Experiment Setup Yes We set H = 25 and K = 1, 000 in Algorithm 1. The hyperparameter ρ in the constrained setting, β in the regularized setting, and cbonus are tuned from {0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1}, with the final choice presented in Table 2. ... Table 2. hyper-parameters for Section 6.2 (Learning on Simulated RMDPs) ... Table 3. hyper-parameters for Section 6.3 (Learning the Frozen Lake Problem)