reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ACTIVE: Offline Reinforcement Learning via Adaptive Imitation and In-sample $V$-Ensemble

Authors: Tianyuan Chen, Ronglong Cai, Faguo Wu, Xiao Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on the D4RL benchmarks demonstrate that ACTIVE alleviates overfitting of value functions and outperforms existing in-sample methods in terms of learning stability and policy optimality. In this section, we present empirical evaluations of ACTIVE against baseline algorithms. We first compare our method with baseline model-free offline RL methods on the D4RL benchmark. We then analyze the effect of in-sample V -ensemble (IVE) and adaptive cloning temperature (ACT) in ablation studies.
Researcher Affiliation	Academia	1School of Artificial Intelligence, Beihang University 2School of Mathematical Sciences, Beihang University 3Key Laboratory of Mathematics, Informatics and Behavioral Semantics, Mo E, Beihang University 4Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University 5Hangzhou International Innovation Institute of Beihang University 6Zhongguancun Laboratory Emails: EMAIL.
Pseudocode	Yes	6 ALGORITHM SUMMARY Algorithm 1 ACTIVE Hyperparameters: f = fα, m, HD, LR λ, λβ, EMA η. Initialize: ϕ, θ, ˆθ, {ψi}m i=1, β, D. for each gradient step do ψi ← ψi − λψi Lf V (ψi) (Equation (6)) θ ← θ − λθLQ(θ) (Equation (7)) ˆθ ← (1 − η)ˆθ + ηθ β ← β − λβ ββ (Equation (13)) ϕ ← ϕ − λϕLπ(ϕ) (Equation (14)) end for
Open Source Code	No	The paper does not provide an explicit statement of code release or a link to a code repository.
Open Datasets	Yes	Our experiments on the D4RL benchmarks demonstrate that ACTIVE alleviates overfitting of value functions and outperforms existing in-sample methods in terms of learning stability and policy optimality. we train SQL agents on the D4RL (Fu et al., 2020) antmaze-umaze-d-v2 dataset for 1M updates. Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for Deep Data-Driven Reinforcement Learning. ar Xiv Preprint ar Xiv:2004.07219, 2020.
Dataset Splits	No	The paper mentions using D4RL datasets and evaluating policies but does not explicitly provide specific training/validation/test dataset splits, percentages, or sample counts.
Hardware Specification	Yes	Hardware. We use the following hardware: NVIDIA RTX 3090 Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
Software Dependencies	Yes	Software. We use the following software versions: D4RL 1.1 (Fu et al., 2020) (Apache-2.0 license) Jax 0.4.9 (Bradbury et al., 2018) (Apache-2.0 license) Mu Jo Co 2.1.0 (Todorov et al., 2012) (Apache-2.0 license) Gym 0.23.1 (Brockman et al., 2016) (MIT license)
Experiment Setup	Yes	General. We implement ACTIVE and reproduce IQL (Kostrikov et al., 2022) and SQL (Xu et al., 2023) based on the author-provided source code. We mainly tune the implicit regularization level (α or τ) along with ensemble size m and target likelihood HD while most remaining hyperparameters remained unchanged from the corresponding baseline (IQL for ACTIVE-I, SQL for ACTIVE-S). Table 4: ACTIVE-I/S general hyperparameters. Hyperparameter Value Actor learning rate 3 10 4 2 10 4 for Ant Maze in ACTIVE-S Critic learning rate 3 10 4 2 10 4 for Ant Maze in ACTIVE-S Value learning rate 3 10 4 2 10 4 for Ant Maze in ACTIVE-S Batch size 256 Optimizer Adam Network (all) 3 layers Re LU activated MLPs with 256 units Discount γ 0.99 Polyak η 0.005 Layer normalization Off Value Dropout Off Actor Dropout Off (p = 0.1 for Kitchen)