Highway Graph to Accelerate Reinforcement Learning

Authors: Zidu Yin, Zhen Zhang, Dong Gong, Stefano V Albrecht, Javen Qinfeng Shi

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across four categories of environments demonstrate that our method learns significantly faster than established and state-of-the-art model-free and model-based RL algorithms (often by a factor of 10 to 150) while maintaining equal or superior expected returns.
Researcher Affiliation Academia Zidu Yin EMAIL School of Information Science and Technology Yunnan Normal University Zhen Zhang EMAIL School of Computer and Mathematical Sciences Adelaide University Dong Gong EMAIL School of Computer Science and Engineering The University of New South Wales Stefano V. Albrecht EMAIL School of Informatics University of Edinburgh Javen Q. Shi EMAIL School of Computer and Mathematical Sciences Adelaide University
Pseudocode Yes Algorithm 1 Highway graph incremental construction Algorithm 2 Value updating on highway graph
Open Source Code Yes The implementation of our highway graph RL method is publicly available at https://github.com/coodest/highway RL.
Open Datasets Yes Simple Maze 1: a simple maze environment with customizable sizes. Toy Text (Towers et al., 2023): a tiny and simple game set, with small discrete state and action spaces, including Frozen Lake, Taxi, Cliff Walking, and Blackjack. Google Research Football (GRF) (Kurach et al., 2020): a physical-based football simulator. Atari learning environment (Bellemare et al., 2013): a simulator for Atari 2600 console games.
Dataset Splits No To better show the training efficiency advantages of our highway graph RL method, we only use one million frames of interaction from different types of Environments. Whether the information from one million frames is enough to solve the task in the environments will also be shown.
Hardware Specification Yes All the experiments were running in the Docker container with identical system resources including 8 CPU cores with 128 GB RAM, and an NVIDIA RTX 3090Ti GPU with 24 GB VRAM.
Software Dependencies No In addition, we use RLlib (Liang et al., 2018) implementations for DQN, PPO, A3C, R2D2, and IMPALA. Other baselines including NEC, MFEC, EMDQN, GEM, and Gumbel Mu Zero are obtained from its official repository.
Experiment Setup Yes For Simple Maze, Toy text, and Google Research Football, we adopt a discount factor of 0.99, and for Atari games, we adopt a discount factor of 1 1e6 since very long trajectories in Atari games are not suitable for recurrent random projectors. DQN: double DQN is used with dueling enabled. N-step of Q-learning is 1. A Huber loss is computed for TD error. A3C: the coefficient for the value function term and the entropy regularizer term in the loss function are 0.5 and 0.01 respectively. The grad clip is set to 40. PPO: initial coefficient and target value for KL divergence are 0.5 and 0.01 respectively. The coefficient of the value function loss is 1.0.