reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Policy Regularization on Globally Accessible States in Cross-Dynamics Reinforcement Learning

Authors: Zhenghai Xue, Lang Feng, Jiacheng Xu, Kang Kang, Xiang Wen, Bo An, Shuicheng Yan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across multiple benchmarks demonstrate ASOR s effectiveness in enhancing state-of-the-art cross-domain policy transfer algorithms, significantly improving their performance. In this section, we conduct experiments to investigate the following questions: (1) Can ASOR efficiently learn from data with dynamics shift and outperform current state-of-the-art algorithms? (2) Is ASOR general enough when applied to different styles of training environments, various sources of environment dynamics shift, and when combined with distinct algorithm setup? (3) How does each component of ASOR (e.g., the reward augmentation and the pseudocount of state visitations) and its hyperparameters perform in practice? To answer questions (1)(2), we construct crossdynamics training environments based on tasks including Minigrid (Chevalier-Boisvert et al., 2023)4, D4RL (Fu et al., 2020), Mu Jo Co (Todorov et al., 2012), and a Fall Guys-like Battle Royal Game. Dynamics shift in these environments comes from changes in navigation maps, evolvements of environment parameters, and different layouts of obstacles.
Researcher Affiliation	Collaboration	Zhenghai Xue 1 Lang Feng 1 Jiacheng Xu 1 2 Kang Kang 2 Xiang Wen 2 Bo An 1 2 Shuicheng YAN 2 3 1Nanyang Technological University, Singapore 2Skywork AI 3National University of Singapore.
Pseudocode	Yes	Algorithm 1 The workflow of ASOR on top of ESCP (Luo et al., 2022).
Open Source Code	No	The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	To answer questions (1)(2), we construct crossdynamics training environments based on tasks including Minigrid (Chevalier-Boisvert et al., 2023)4, D4RL (Fu et al., 2020), Mu Jo Co (Todorov et al., 2012), and a Fall Guys-like Battle Royal Game. Dynamics shift in these environments comes from changes in navigation maps, evolvements of environment parameters, and different layouts of obstacles.
Dataset Splits	No	For offline RL benchmarks, we collect the static dataset from environments with three different environment dynamics in the format of D4RL (Fu et al., 2020). Specifically, data from the original Mu Jo Co environments, environments with 3 times larger body mass, and environments with 10 times higher medium density are included. For baseline algorithms, we inlude If O algorithms BCO (Torabi et al., 2018a) and SOIL (Radosavovic et al., 2021), standard offline RL algorithms CQL (Kumar et al., 2020) and MOPO (Yu et al., 2020), offline cross-domain policy transfer algorithms MAPLE (Chen et al., 2021), MAPLE+DARA (Liu et al., 2022), and MAPLE+SRPO (Xue et al., 2023a).
Hardware Specification	Yes	The training was conducted using NVIDIA TESLA V100 GPUs and takes around 20 hours to train 6M steps.
Software Dependencies	No	We utilized the Ray RLlib framework (Liang et al., 2018), configuring 100 training workers and 20 evaluation workers.
Experiment Setup	Yes	The batch size was set to 1024, with an initial learning rate of 1 10 3, which linearly decayed to 3 10 4 over 250 steps. An entropy regularization coefficient of 0.003 was employed to ensure adequate exploration during training.