reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Online Pre-Training for Offline-to-Online Reinforcement Learning

Authors: Yongjae Shin, Jeonghye Kim, Whiyoung Jung, Sunghoon Hong, Deunsol Yoon, Youngsoo Jang, Geon-Hyeong Kim, Jongseong Chae, Youngchul Sung, Kanghoon Lee, Woohyung Lim

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Implementation of OPT on TD3 and SPOT demonstrates an average 30% improvement in performance across a wide range of D4RL environments, including Mu Jo Co, Antmaze, and Adroit. Table 1. Comparison of the normalized scores after online fine-tuning for each environment in Mu Jo Co domain. Figure 3. Interquartile Mean (IQM) comparison of baseline methods on the Mu Jo Co and Antmaze domains.
Researcher Affiliation	Collaboration	Work done during an internship at LG AI Research. 1LG AI Research, Seoul, Republic of Korea. 2School of Electrical Engineering, KAIST, Daejeon, Republic of Korea.
Pseudocode	Yes	Algorithm 1 OPT: Online Pre-Training for Offline-to Online Reinforcement Learning
Open Source Code	Yes	The complete training code is available at https://github.com/LGAI-Research/opt.
Open Datasets	Yes	We evaluate the performance of OPT across three domains from the D4RL benchmark (Fu et al. 2020).
Dataset Splits	No	Offline RL focuses on training agents using a static dataset D = {(s, a, r, s )}, usually generated by various policies. In the proposed method, Qon-pt is introduced as an additional value function specifically designed for online fine-tuning. Considering Qon-pt, one straightforward approach is to add a randomly initialized value function. In this case, as Qon-pt begins learning from the online fine-tuning, it is expected to adapt well to the new data encountered during online fine-tuning. However, since Qon-pt is required to train from scratch, it often disrupts policy learning in the early stages. To address the potential negative effects that can arise from adding Qon-pt with random initialization, we introduce a pre-training phase, termed Online Pre-Training, specifically designed to train Qon-pt in advance of online fine-tuning. The following sections explore the design of the Online Pre-Training method in detail. Designing Datasets. As the initial stage of Online Pre Training, the only available dataset to train Qon-pt is the offline dataset Boff.
Hardware Specification	Yes	Comparison of wall-clock training time for TD3 and TD3 integrated with OPT on the walker2d-random-v2 environment using a single NVIDIA L40 GPU.
Software Dependencies	No	We implement OPT based on the codebase of each backbone algorithm. TD3 and RLPD are based on their official implementation, SPOT, IQL are built upon the CORL (Tarasov et al. 2024) library. The Online Pre-Training phase is implemented by modifying the meta-adaptation method provided in the OEMA code, tailored specifically for value function learning. Additionally, the balanced replay (Lee et al. 2022) is implemented using the authors official implementation.
Experiment Setup	Yes	In our proposed Online Pre-Training, we set Nτ to 25k and Npretrain to 50k for all environments. Additionally, for the Mu Jo Co domain, we use TD3 with a UTD ratio of 5 as the baseline. In the Adroit domain, we use SPOT as the baseline, trained with layer normalization (Ba 2016) applied to both the actor and critic networks. As mentioned in Section 3.2, we use the parameter κ to assign higher weight to Qoff-pt during the early stages of online fine-tuning, gradually shifting to give higher weight to Qon-pt as training progresses. We control κ through linear scheduling. Table 15 outlines the κ scheduling for each environment, where κinit represents the initial value of κ at the start of online phase, Tdecay specifies the number of timesteps over which κ increases, and κend indicates the final value of κ after increase.