ACTIVE: Offline Reinforcement Learning via Adaptive Imitation and In-sample $V$-Ensemble

Authors: Tianyuan Chen, Ronglong Cai, Faguo Wu, Xiao Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on the D4RL benchmarks demonstrate that ACTIVE alleviates overfitting of value functions and outperforms existing in-sample methods in terms of learning stability and policy optimality. In this section, we present empirical evaluations of ACTIVE against baseline algorithms. We first compare our method with baseline model-free offline RL methods on the D4RL benchmark. We then analyze the effect of in-sample V -ensemble (IVE) and adaptive cloning temperature (ACT) in ablation studies.
Researcher Affiliation Academia 1School of Artificial Intelligence, Beihang University 2School of Mathematical Sciences, Beihang University 3Key Laboratory of Mathematics, Informatics and Behavioral Semantics, Mo E, Beihang University 4Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University 5Hangzhou International Innovation Institute of Beihang University 6Zhongguancun Laboratory Emails: EMAIL.
Pseudocode Yes 6 ALGORITHM SUMMARY Algorithm 1 ACTIVE Hyperparameters: f = fα, m, HD, LR λ, λβ, EMA η. Initialize: ϕ, θ, ˆθ, {ψi}m i=1, β, D. for each gradient step do ψi ← ψi − λψi Lf V (ψi) (Equation (6)) θ ← θ − λθLQ(θ) (Equation (7)) ˆθ ← (1 − η)ˆθ + ηθ β ← β − λβ ββ (Equation (13)) ϕ ← ϕ − λϕLπ(ϕ) (Equation (14)) end for
Open Source Code No The paper does not provide an explicit statement of code release or a link to a code repository.
Open Datasets Yes Our experiments on the D4RL benchmarks demonstrate that ACTIVE alleviates overfitting of value functions and outperforms existing in-sample methods in terms of learning stability and policy optimality. we train SQL agents on the D4RL (Fu et al., 2020) antmaze-umaze-d-v2 dataset for 1M updates. Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for Deep Data-Driven Reinforcement Learning. ar Xiv Preprint ar Xiv:2004.07219, 2020.
Dataset Splits No The paper mentions using D4RL datasets and evaluating policies but does not explicitly provide specific training/validation/test dataset splits, percentages, or sample counts.
Hardware Specification Yes Hardware. We use the following hardware: NVIDIA RTX 3090 Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
Software Dependencies Yes Software. We use the following software versions: D4RL 1.1 (Fu et al., 2020) (Apache-2.0 license) Jax 0.4.9 (Bradbury et al., 2018) (Apache-2.0 license) Mu Jo Co 2.1.0 (Todorov et al., 2012) (Apache-2.0 license) Gym 0.23.1 (Brockman et al., 2016) (MIT license)
Experiment Setup Yes General. We implement ACTIVE and reproduce IQL (Kostrikov et al., 2022) and SQL (Xu et al., 2023) based on the author-provided source code. We mainly tune the implicit regularization level (α or τ) along with ensemble size m and target likelihood HD while most remaining hyperparameters remained unchanged from the corresponding baseline (IQL for ACTIVE-I, SQL for ACTIVE-S). Table 4: ACTIVE-I/S general hyperparameters. Hyperparameter Value Actor learning rate 3 10 4 2 10 4 for Ant Maze in ACTIVE-S Critic learning rate 3 10 4 2 10 4 for Ant Maze in ACTIVE-S Value learning rate 3 10 4 2 10 4 for Ant Maze in ACTIVE-S Batch size 256 Optimizer Adam Network (all) 3 layers Re LU activated MLPs with 256 units Discount γ 0.99 Polyak η 0.005 Layer normalization Off Value Dropout Off Actor Dropout Off (p = 0.1 for Kitchen)