Sample Complexity of Distributionally Robust Off-Dynamics Reinforcement Learning with Online Interaction
Authors: Yiting He, Zhishuai Liu, Weixin Wang, Pan Xu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we validate our theoretical results through comprehensive numerical experiments. ... We conduct comprehensive numerical experiments to validate our theoretical findings. In a simulated MDP, we show that the performance of learned policies degrades as Cvr increases. We evaluate our algorithms in a simulated RMDP and the Frozen Lake environment, highlighting their effectiveness when distribution shifts are significant. |
| Researcher Affiliation | Academia | 1Duke University. Correspondence to: Pan Xu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Online Robust Bellman Iteration (ORBIT) ... Algorithm 2 A more efficient solver for the CRMDP-TV Setting |
| Open Source Code | Yes | The implementation of our ORBIT algorithm is available at https://github.com/panxulab/Online-Robust-Bellman-Iteration. |
| Open Datasets | Yes | Now we test our algorithm in a hard-to-explore setting, the Frozen Lake problem. ... We use the default map in the Open AI Gym library, which is illustrated in Example A.1 |
| Dataset Splits | No | The paper describes online interaction with environments (simulated MDPs, Frozen Lake) for K episodes and evaluates the learned policies in target environments with different perturbation rates. However, it does not provide specific training/test/validation dataset splits in the traditional sense, as data is generated dynamically through interaction. |
| Hardware Specification | Yes | All numerical experiments were conducted on a server equipped with Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz. |
| Software Dependencies | No | The paper mentions using the 'Open AI Gym library' but does not specify a version number for it or any other key software dependencies. |
| Experiment Setup | Yes | We set H = 25 and K = 1, 000 in Algorithm 1. The hyperparameter ρ in the constrained setting, β in the regularized setting, and cbonus are tuned from {0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1}, with the final choice presented in Table 2. ... Table 2. hyper-parameters for Section 6.2 (Learning on Simulated RMDPs) ... Table 3. hyper-parameters for Section 6.3 (Learning the Frozen Lake Problem) |