Policy Regularization on Globally Accessible States in Cross-Dynamics Reinforcement Learning
Authors: Zhenghai Xue, Lang Feng, Jiacheng Xu, Kang Kang, Xiang Wen, Bo An, Shuicheng Yan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across multiple benchmarks demonstrate ASOR s effectiveness in enhancing state-of-the-art cross-domain policy transfer algorithms, significantly improving their performance. In this section, we conduct experiments to investigate the following questions: (1) Can ASOR efficiently learn from data with dynamics shift and outperform current state-of-the-art algorithms? (2) Is ASOR general enough when applied to different styles of training environments, various sources of environment dynamics shift, and when combined with distinct algorithm setup? (3) How does each component of ASOR (e.g., the reward augmentation and the pseudocount of state visitations) and its hyperparameters perform in practice? To answer questions (1)(2), we construct crossdynamics training environments based on tasks including Minigrid (Chevalier-Boisvert et al., 2023)4, D4RL (Fu et al., 2020), Mu Jo Co (Todorov et al., 2012), and a Fall Guys-like Battle Royal Game. Dynamics shift in these environments comes from changes in navigation maps, evolvements of environment parameters, and different layouts of obstacles. |
| Researcher Affiliation | Collaboration | Zhenghai Xue 1 Lang Feng 1 Jiacheng Xu 1 2 Kang Kang 2 Xiang Wen 2 Bo An 1 2 Shuicheng YAN 2 3 1Nanyang Technological University, Singapore 2Skywork AI 3National University of Singapore. |
| Pseudocode | Yes | Algorithm 1 The workflow of ASOR on top of ESCP (Luo et al., 2022). |
| Open Source Code | No | The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | To answer questions (1)(2), we construct crossdynamics training environments based on tasks including Minigrid (Chevalier-Boisvert et al., 2023)4, D4RL (Fu et al., 2020), Mu Jo Co (Todorov et al., 2012), and a Fall Guys-like Battle Royal Game. Dynamics shift in these environments comes from changes in navigation maps, evolvements of environment parameters, and different layouts of obstacles. |
| Dataset Splits | No | For offline RL benchmarks, we collect the static dataset from environments with three different environment dynamics in the format of D4RL (Fu et al., 2020). Specifically, data from the original Mu Jo Co environments, environments with 3 times larger body mass, and environments with 10 times higher medium density are included. For baseline algorithms, we inlude If O algorithms BCO (Torabi et al., 2018a) and SOIL (Radosavovic et al., 2021), standard offline RL algorithms CQL (Kumar et al., 2020) and MOPO (Yu et al., 2020), offline cross-domain policy transfer algorithms MAPLE (Chen et al., 2021), MAPLE+DARA (Liu et al., 2022), and MAPLE+SRPO (Xue et al., 2023a). |
| Hardware Specification | Yes | The training was conducted using NVIDIA TESLA V100 GPUs and takes around 20 hours to train 6M steps. |
| Software Dependencies | No | We utilized the Ray RLlib framework (Liang et al., 2018), configuring 100 training workers and 20 evaluation workers. |
| Experiment Setup | Yes | The batch size was set to 1024, with an initial learning rate of 1 10 3, which linearly decayed to 3 10 4 over 250 steps. An entropy regularization coefficient of 0.003 was employed to ensure adequate exploration during training. |