Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization

Authors: Cheng Tang, Zhishuai Liu, Pan Xu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, numerical experiments validate that R2PVI learns robust policies and exhibits superior computational efficiency compared to baseline methods. We conduct numerical experiments to explore (1) the robustness of R2PVI regarding dynamics shifts, (2) how the regularizer λ affects the robustness of R2PVI, and (3) the computation cost of R2PVI. We evaluate our algorithm in two off-dynamics problems. All experiments are conducted on a machine with an 11th Gen Intel(R) Core(TM) i5-11300H @ 3.10GHz processor, featuring 8 logical CPUs, 4 physical cores, and 2 threads per core. The implementation of our R2PVI algorithm is available at https://github.com/panxulab/Robust-Regularized-Pessimistic-Value-Iteration. All experiment results are shown in Figure 2.
Researcher Affiliation Academia 1 University of Illinois Urbana-Champaign, work was done when Cheng was in Tsinghua University 2Duke University. Correspondence to: Pan Xu <EMAIL>.
Pseudocode Yes We propose the meta-algorithm in Algorithm 1. Algorithm 1 R2PVI under general f-divergence. Next, we instantiate the f-divergence with TV, KL and χ2-divergences respectively, and specify the estimation procedure corresponding to different divergences. Algorithm 2 R2PVI under TV, KL and χ2 divergence.
Open Source Code Yes The implementation of our R2PVI algorithm is available at https://github.com/panxulab/Robust-Regularized-Pessimistic-Value-Iteration.
Open Datasets No We conduct experiments in simulated environments, including a linear MDP setting (Liu & Xu, 2024a) and the American Put Option environment (Tamar et al., 2014). We borrow the simulated linear MDP constructed in Liu & Xu (2024a) and adapt it to the offline setting. In this section, we test our algorithm in a simulated American Put Option environment (Tamar et al., 2014; Zhou et al., 2021) that does not belong to the d-rectangular linear RRMDP. We collect the offline data from the nominal environment by a uniformly random behavior policy.
Dataset Splits No The sample size of the offline dataset is set to 100. We collect the offline data from the nominal environment by a uniformly random behavior policy. This text describes the collection of offline data and its size but does not specify any training, testing, or validation splits.
Hardware Specification Yes All experiments are conducted on a machine with an 11th Gen Intel(R) Core(TM) i5-11300H @ 3.10GHz processor, featuring 8 logical CPUs, 4 physical cores, and 2 threads per core.
Software Dependencies No The paper does not explicitly state specific version numbers for software dependencies or libraries used in the implementation of the algorithms (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We set the behavior policy πb such that it chooses actions uniformly at random. The sample size of the offline dataset is set to 100. For completeness, we present more details on the experiment set up and results in Appendix A. The hyper-parameters in our setting are shown in Table 2. The horizon is 3, the β, γ, δ are set the same in all tasks, the ξ 1 is set as 0.3, 0.2, 0.1 in Figure 1 in order to illustrate the versatility of our algorithms. (Table 2 includes: H (Horizon) 3, β (pessimism parameter) 1, γ 0.1, δ 0.3, ξ 1 0.3, 0.2, 0.1). We set p0 = 0.5 in the nominal environment. The β = 0.1 and γ = 1 are set hyper-parameters in all tasks.