Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning

Authors: Kwanyoung Park, Youngwoon Lee

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results show that LEQ significantly outperforms previous model-based offline RL methods on long-horizon tasks, such as the D4RL Ant Maze tasks, matching or surpassing the performance of model-free approaches and sequence modeling approaches. Furthermore, LEQ matches the performance of state-of-the-art model-based and model-free methods in dense-reward environments across both state-based tasks (Neo RL and D4RL) and pixel-based tasks (V-D4RL), showing that LEQ works robustly across diverse domains. Our ablation studies demonstrate that lower expectile regression, λ-returns, and critic training on offline data are all crucial for LEQ.
Researcher Affiliation Academia Kwanyoung Park Youngwoon Lee Yonsei University
Pseudocode Yes Algorithm 1 LEQ: Lower Expectile Q-learning with λ-returns Algorithm 2 FQE: Fitted Q Evaluation (Le et al., 2019)
Open Source Code Yes To ensure the reproducibility of our work, we provide the full code of LEQ in the supplementary materials, along with instructions to replicate the experiments presented in the paper. We provide the experimental details in Appendix A and the proof on the derivation of our surrogate policy objective in Appendix B.
Open Datasets Yes The experiments on the D4RL Ant Maze, Mu Jo Co Gym (Fu et al., 2020), Neo RL (Qin et al., 2022), and V-D4RL (Lu et al., 2023) benchmarks show that LEQ improves model-based offline RL across diverse domains.
Dataset Splits Yes To verify the strength of our low-bias model-based conservative value estimation in diverse domains, we test LEQ on four benchmarks: D4RL Ant Maze, D4RL Mu Jo Co Gym (Fu et al., 2020), Neo RL (Qin et al., 2022), and V-D4RL (Lu et al., 2023). We first test on long-horizon Ant Maze tasks: umaze, medium, large from D4RL, and ultra from Jiang et al. (2023), as shown in Figure 3. We also evaluate LEQ on locomotion tasks (Figure 4): state-based tasks from D4RL, Neo RL and pixel-based tasks from V-D4RL.
Hardware Specification Yes All experiments are done on a single RTX 4090 GPU and 8 AMD EPYC 9354 CPU cores, supported by Advanced Database System Infrastructure(NFEC-2024-11-300458).
Software Dependencies No LEQ follows MOBILE for most implementation details but implemented in JAX (Bradbury et al., 2018), which makes it 6 times faster than the PyTorch versions of MOBILE and CBOP. The paper mentions JAX but does not specify a version number, nor does it list other software dependencies with specific version numbers.
Experiment Setup Yes Table 5: Shared hyperparameters of LEQ in state-based experiments. Table 6: Task-specific hyperparameter τ of LEQ in state-based experiments.