Highway Graph to Accelerate Reinforcement Learning
Authors: Zidu Yin, Zhen Zhang, Dong Gong, Stefano V Albrecht, Javen Qinfeng Shi
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across four categories of environments demonstrate that our method learns significantly faster than established and state-of-the-art model-free and model-based RL algorithms (often by a factor of 10 to 150) while maintaining equal or superior expected returns. |
| Researcher Affiliation | Academia | Zidu Yin EMAIL School of Information Science and Technology Yunnan Normal University Zhen Zhang EMAIL School of Computer and Mathematical Sciences Adelaide University Dong Gong EMAIL School of Computer Science and Engineering The University of New South Wales Stefano V. Albrecht EMAIL School of Informatics University of Edinburgh Javen Q. Shi EMAIL School of Computer and Mathematical Sciences Adelaide University |
| Pseudocode | Yes | Algorithm 1 Highway graph incremental construction Algorithm 2 Value updating on highway graph |
| Open Source Code | Yes | The implementation of our highway graph RL method is publicly available at https://github.com/coodest/highway RL. |
| Open Datasets | Yes | Simple Maze 1: a simple maze environment with customizable sizes. Toy Text (Towers et al., 2023): a tiny and simple game set, with small discrete state and action spaces, including Frozen Lake, Taxi, Cliff Walking, and Blackjack. Google Research Football (GRF) (Kurach et al., 2020): a physical-based football simulator. Atari learning environment (Bellemare et al., 2013): a simulator for Atari 2600 console games. |
| Dataset Splits | No | To better show the training efficiency advantages of our highway graph RL method, we only use one million frames of interaction from different types of Environments. Whether the information from one million frames is enough to solve the task in the environments will also be shown. |
| Hardware Specification | Yes | All the experiments were running in the Docker container with identical system resources including 8 CPU cores with 128 GB RAM, and an NVIDIA RTX 3090Ti GPU with 24 GB VRAM. |
| Software Dependencies | No | In addition, we use RLlib (Liang et al., 2018) implementations for DQN, PPO, A3C, R2D2, and IMPALA. Other baselines including NEC, MFEC, EMDQN, GEM, and Gumbel Mu Zero are obtained from its official repository. |
| Experiment Setup | Yes | For Simple Maze, Toy text, and Google Research Football, we adopt a discount factor of 0.99, and for Atari games, we adopt a discount factor of 1 1e6 since very long trajectories in Atari games are not suitable for recurrent random projectors. DQN: double DQN is used with dueling enabled. N-step of Q-learning is 1. A Huber loss is computed for TD error. A3C: the coefficient for the value function term and the entropy regularizer term in the loss function are 0.5 and 0.01 respectively. The grad clip is set to 40. PPO: initial coefficient and target value for KL divergence are 0.5 and 0.01 respectively. The coefficient of the value function loss is 1.0. |