reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning

Authors: Kwanyoung Park, Youngwoon Lee

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results show that LEQ significantly outperforms previous model-based offline RL methods on long-horizon tasks, such as the D4RL Ant Maze tasks, matching or surpassing the performance of model-free approaches and sequence modeling approaches. Furthermore, LEQ matches the performance of state-of-the-art model-based and model-free methods in dense-reward environments across both state-based tasks (Neo RL and D4RL) and pixel-based tasks (V-D4RL), showing that LEQ works robustly across diverse domains. Our ablation studies demonstrate that lower expectile regression, λ-returns, and critic training on offline data are all crucial for LEQ.
Researcher Affiliation	Academia	Kwanyoung Park Youngwoon Lee Yonsei University
Pseudocode	Yes	Algorithm 1 LEQ: Lower Expectile Q-learning with λ-returns Algorithm 2 FQE: Fitted Q Evaluation (Le et al., 2019)
Open Source Code	Yes	To ensure the reproducibility of our work, we provide the full code of LEQ in the supplementary materials, along with instructions to replicate the experiments presented in the paper. We provide the experimental details in Appendix A and the proof on the derivation of our surrogate policy objective in Appendix B.
Open Datasets	Yes	The experiments on the D4RL Ant Maze, Mu Jo Co Gym (Fu et al., 2020), Neo RL (Qin et al., 2022), and V-D4RL (Lu et al., 2023) benchmarks show that LEQ improves model-based offline RL across diverse domains.
Dataset Splits	Yes	To verify the strength of our low-bias model-based conservative value estimation in diverse domains, we test LEQ on four benchmarks: D4RL Ant Maze, D4RL Mu Jo Co Gym (Fu et al., 2020), Neo RL (Qin et al., 2022), and V-D4RL (Lu et al., 2023). We first test on long-horizon Ant Maze tasks: umaze, medium, large from D4RL, and ultra from Jiang et al. (2023), as shown in Figure 3. We also evaluate LEQ on locomotion tasks (Figure 4): state-based tasks from D4RL, Neo RL and pixel-based tasks from V-D4RL.
Hardware Specification	Yes	All experiments are done on a single RTX 4090 GPU and 8 AMD EPYC 9354 CPU cores, supported by Advanced Database System Infrastructure(NFEC-2024-11-300458).
Software Dependencies	No	LEQ follows MOBILE for most implementation details but implemented in JAX (Bradbury et al., 2018), which makes it 6 times faster than the PyTorch versions of MOBILE and CBOP. The paper mentions JAX but does not specify a version number, nor does it list other software dependencies with specific version numbers.
Experiment Setup	Yes	Table 5: Shared hyperparameters of LEQ in state-based experiments. Table 6: Task-specific hyperparameter τ of LEQ in state-based experiments.