Model-Free Offline Reinforcement Learning with Enhanced Robustness
Authors: Chi Zhang, Zain Ulabedeen Farhat, George Atia, Yue Wang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments further demonstrate that our approach significantly improves robustness in a more scalable manner than existing methods. We conduct extensive numerical experiments to demonstrate the improvements in robustness achieved by our algorithms in both simulated environments (Archibald et al., 1995) and real physics-based Classic Control problems (Brockman et al., 2016). In each case, our algorithm consistently outperforms existing methods in handling model uncertainty, showcasing its enhanced ability to maintain stable performance across a wide range of environmental perturbations. |
| Researcher Affiliation | Academia | Chi Zhang1, Zain Ulabedeen Farhat1, George K. Atia1,2, Yue Wang1,2 1 Department of Electrical and Computer Engineering 2 Department of Computer Science University of Central Florida Orlando, FL 32816, USA EMAIL |
| Pseudocode | Yes | Algorithm 1 Double-Pessimism Q-Learning for finite-horizon RMDPs. ... Algorithm 3 Double-Pessimism Q-Learning for infinite-horizon RMDPs. ... Algorithm 2 Double-Pessimism Q-Learning for infinite-horizon RMDPs with χ2-divergence uncertainty set. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We first evaluate the performance of our algorithm on the Garnet problem (Archibald et al., 1995), a randomly generated MDP G(a, b, c) with a states, b actions, and c branches (see Appendix A for a more detailed description). ... To further demonstrate the improvements in both scalability and robustness offered by our approach, we consider more complex Classic Control tasks from Open AI Gym (Brockman et al., 2016), specifically Mountain Car and Cart Pole (results are shown in Figure 4 in Appendix). |
| Dataset Splits | No | The paper describes how datasets are generated (e.g., '10 datasets are generated at each dataset size from T = 1000 to T = 20000') and how policies are evaluated in perturbed environments, but it does not specify explicit training/test/validation splits for dataset evaluation in the traditional sense. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Open AI Gym (Brockman et al., 2016)', 'Conservative Q-learning (CQL, (Kumar et al., 2020)) and Implicit Q-learning (IQL, (Kostrikov et al., 2021))', but does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We set γ = 0.95, Cb = 1 x 10-4 and δ = 0.02. ... The uncertainty set is constructed using the lα-norm, with the radius Rs,a ∈ [0.1, 0.5]. ... The randomness (i.e., optimality) of the behavior policy is controlled via temperature parameter tb = 1. State-action pairs with probabilities Ps,a ≥ 0.03 (for G(20, 30, 20)), Ps,a ≥ 0.02 (for G(30, 50, 30)) and Ps,a ≥ 0.01 (for G(50, 100, 50)) are then excluded to achieve partial coverage. ... In our experiments, we set γ = 0.95, Cb = 1 x 10-4 and δ = 0.02. After a policy is learned, we test its performance under a perturbed environment with the parameter randomly generated from [-τ, τ] for 800 times. |