Highly Parallelized Reinforcement Learning Training with Relaxed Assignment Dependencies

Authors: Zhouyu He, Peng Qiao, Rongchun Li, Yong Dou, Yusong Tan

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted extensive experiments. Tian Ji achieves a convergence time acceleration ratio of up to 4.37 compared to related comparison systems. When scaled to eight computational nodes, Tian Ji shows a convergence time speedup of 1.6 and a throughput speedup of 7.13 relative to Xing Tian, demonstrating its capability to accelerate training and scalability. In data transmission efficiency experiments, Tian Ji significantly outperforms other systems, approaching hardware limits. Tian Ji also shows effectiveness in on-policy algorithms, achieving convergence time acceleration ratios of 4.36 and 2.95 compared to RLlib and Xing Tian.
Researcher Affiliation Academia 1College of Computer Science and Technology, National University of Defense Technology 2National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology EMAIL
Pseudocode Yes Algorithm 1: Pseudo-code for Function Approximationbased Temporal Difference(0)
Open Source Code No The paper does not provide a specific link or explicit statement about releasing the source code for the methodology described in this paper. It mentions 'Ape X s open-source implementation' but that refers to a baseline, not their own work.
Open Datasets Yes Algorithms & Environment. We conducted experiments in Atari and Open AI Gym. The algorithms evaluated are DQN(off-policy) and PPO(on-policy).
Dataset Splits No The paper mentions using 'RLlib s default network architecture and parameters as benchmarks for both Gym and Atari tasks' and states 'All comparisons used identical network architectures and hyperparameters across the same games to ensure fairness.' However, it does not provide specific details on how the data from the Atari and OpenAI Gym environments were split into training, validation, or test sets (e.g., percentages, exact counts, or specific split methodologies).
Hardware Specification Yes Testbed. We configured two hardware platforms for our experiments. The first platform is a CPU-only Slurm cluster with 8 computing nodes. Each node is equipped with 2 Intel Xeon Gold 6248 processors, providing a total of 40 physical cores per node. Each node has 384GB of memory, and they are interconnected using Connect X-6 high-speed interconnects. The second platform is a heterogeneous machine, equipped with one A100 GPU, 40 physical cores, and 376GB of memory.
Software Dependencies No The paper mentions software like 'RLlib', 'Ray', and 'gRPC library' but does not provide specific version numbers for any of these components, which are necessary for reproducible descriptions of ancillary software.
Experiment Setup Yes The optimal computational-resource mapping, identified by the distributed strategy, includes 2 learners and 4 actors, using 16 cores, denoted as L2A8-C16. A random computational-resource mapping, labeled L1A14C16, was also evaluated. After introducing asynchrony, simulations with serial sample distributions were conducted using two sample ratios: 1:1 (denoted as New) and 1:8 (denoted as Staleness). Four sets of control experiments were conducted.