reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization

Authors: Cheng Tang, Zhishuai Liu, Pan Xu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, numerical experiments validate that R2PVI learns robust policies and exhibits superior computational efficiency compared to baseline methods. We conduct numerical experiments to explore (1) the robustness of R2PVI regarding dynamics shifts, (2) how the regularizer λ affects the robustness of R2PVI, and (3) the computation cost of R2PVI. We evaluate our algorithm in two off-dynamics problems. All experiments are conducted on a machine with an 11th Gen Intel(R) Core(TM) i5-11300H @ 3.10GHz processor, featuring 8 logical CPUs, 4 physical cores, and 2 threads per core. The implementation of our R2PVI algorithm is available at https://github.com/panxulab/Robust-Regularized-Pessimistic-Value-Iteration. All experiment results are shown in Figure 2.
Researcher Affiliation	Academia	1 University of Illinois Urbana-Champaign, work was done when Cheng was in Tsinghua University 2Duke University. Correspondence to: Pan Xu <EMAIL>.
Pseudocode	Yes	We propose the meta-algorithm in Algorithm 1. Algorithm 1 R2PVI under general f-divergence. Next, we instantiate the f-divergence with TV, KL and χ2-divergences respectively, and specify the estimation procedure corresponding to different divergences. Algorithm 2 R2PVI under TV, KL and χ2 divergence.
Open Source Code	Yes	The implementation of our R2PVI algorithm is available at https://github.com/panxulab/Robust-Regularized-Pessimistic-Value-Iteration.
Open Datasets	No	We conduct experiments in simulated environments, including a linear MDP setting (Liu & Xu, 2024a) and the American Put Option environment (Tamar et al., 2014). We borrow the simulated linear MDP constructed in Liu & Xu (2024a) and adapt it to the offline setting. In this section, we test our algorithm in a simulated American Put Option environment (Tamar et al., 2014; Zhou et al., 2021) that does not belong to the d-rectangular linear RRMDP. We collect the offline data from the nominal environment by a uniformly random behavior policy.
Dataset Splits	No	The sample size of the offline dataset is set to 100. We collect the offline data from the nominal environment by a uniformly random behavior policy. This text describes the collection of offline data and its size but does not specify any training, testing, or validation splits.
Hardware Specification	Yes	All experiments are conducted on a machine with an 11th Gen Intel(R) Core(TM) i5-11300H @ 3.10GHz processor, featuring 8 logical CPUs, 4 physical cores, and 2 threads per core.
Software Dependencies	No	The paper does not explicitly state specific version numbers for software dependencies or libraries used in the implementation of the algorithms (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We set the behavior policy πb such that it chooses actions uniformly at random. The sample size of the offline dataset is set to 100. For completeness, we present more details on the experiment set up and results in Appendix A. The hyper-parameters in our setting are shown in Table 2. The horizon is 3, the β, γ, δ are set the same in all tasks, the ξ 1 is set as 0.3, 0.2, 0.1 in Figure 1 in order to illustrate the versatility of our algorithms. (Table 2 includes: H (Horizon) 3, β (pessimism parameter) 1, γ 0.1, δ 0.3, ξ 1 0.3, 0.2, 0.1). We set p0 = 0.5 in the nominal environment. The β = 0.1 and γ = 1 are set hyper-parameters in all tasks.