Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization
Authors: Cheng Tang, Zhishuai Liu, Pan Xu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, numerical experiments validate that R2PVI learns robust policies and exhibits superior computational efficiency compared to baseline methods. We conduct numerical experiments to explore (1) the robustness of R2PVI regarding dynamics shifts, (2) how the regularizer λ affects the robustness of R2PVI, and (3) the computation cost of R2PVI. We evaluate our algorithm in two off-dynamics problems. All experiments are conducted on a machine with an 11th Gen Intel(R) Core(TM) i5-11300H @ 3.10GHz processor, featuring 8 logical CPUs, 4 physical cores, and 2 threads per core. The implementation of our R2PVI algorithm is available at https://github.com/panxulab/Robust-Regularized-Pessimistic-Value-Iteration. All experiment results are shown in Figure 2. |
| Researcher Affiliation | Academia | 1 University of Illinois Urbana-Champaign, work was done when Cheng was in Tsinghua University 2Duke University. Correspondence to: Pan Xu <EMAIL>. |
| Pseudocode | Yes | We propose the meta-algorithm in Algorithm 1. Algorithm 1 R2PVI under general f-divergence. Next, we instantiate the f-divergence with TV, KL and χ2-divergences respectively, and specify the estimation procedure corresponding to different divergences. Algorithm 2 R2PVI under TV, KL and χ2 divergence. |
| Open Source Code | Yes | The implementation of our R2PVI algorithm is available at https://github.com/panxulab/Robust-Regularized-Pessimistic-Value-Iteration. |
| Open Datasets | No | We conduct experiments in simulated environments, including a linear MDP setting (Liu & Xu, 2024a) and the American Put Option environment (Tamar et al., 2014). We borrow the simulated linear MDP constructed in Liu & Xu (2024a) and adapt it to the offline setting. In this section, we test our algorithm in a simulated American Put Option environment (Tamar et al., 2014; Zhou et al., 2021) that does not belong to the d-rectangular linear RRMDP. We collect the offline data from the nominal environment by a uniformly random behavior policy. |
| Dataset Splits | No | The sample size of the offline dataset is set to 100. We collect the offline data from the nominal environment by a uniformly random behavior policy. This text describes the collection of offline data and its size but does not specify any training, testing, or validation splits. |
| Hardware Specification | Yes | All experiments are conducted on a machine with an 11th Gen Intel(R) Core(TM) i5-11300H @ 3.10GHz processor, featuring 8 logical CPUs, 4 physical cores, and 2 threads per core. |
| Software Dependencies | No | The paper does not explicitly state specific version numbers for software dependencies or libraries used in the implementation of the algorithms (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We set the behavior policy πb such that it chooses actions uniformly at random. The sample size of the offline dataset is set to 100. For completeness, we present more details on the experiment set up and results in Appendix A. The hyper-parameters in our setting are shown in Table 2. The horizon is 3, the β, γ, δ are set the same in all tasks, the ξ 1 is set as 0.3, 0.2, 0.1 in Figure 1 in order to illustrate the versatility of our algorithms. (Table 2 includes: H (Horizon) 3, β (pessimism parameter) 1, γ 0.1, δ 0.3, ξ 1 0.3, 0.2, 0.1). We set p0 = 0.5 in the nominal environment. The β = 0.1 and γ = 1 are set hyper-parameters in all tasks. |