Highly Parallelized Reinforcement Learning Training with Relaxed Assignment Dependencies
Authors: Zhouyu He, Peng Qiao, Rongchun Li, Yong Dou, Yusong Tan
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted extensive experiments. Tian Ji achieves a convergence time acceleration ratio of up to 4.37 compared to related comparison systems. When scaled to eight computational nodes, Tian Ji shows a convergence time speedup of 1.6 and a throughput speedup of 7.13 relative to Xing Tian, demonstrating its capability to accelerate training and scalability. In data transmission efficiency experiments, Tian Ji significantly outperforms other systems, approaching hardware limits. Tian Ji also shows effectiveness in on-policy algorithms, achieving convergence time acceleration ratios of 4.36 and 2.95 compared to RLlib and Xing Tian. |
| Researcher Affiliation | Academia | 1College of Computer Science and Technology, National University of Defense Technology 2National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology EMAIL |
| Pseudocode | Yes | Algorithm 1: Pseudo-code for Function Approximationbased Temporal Difference(0) |
| Open Source Code | No | The paper does not provide a specific link or explicit statement about releasing the source code for the methodology described in this paper. It mentions 'Ape X s open-source implementation' but that refers to a baseline, not their own work. |
| Open Datasets | Yes | Algorithms & Environment. We conducted experiments in Atari and Open AI Gym. The algorithms evaluated are DQN(off-policy) and PPO(on-policy). |
| Dataset Splits | No | The paper mentions using 'RLlib s default network architecture and parameters as benchmarks for both Gym and Atari tasks' and states 'All comparisons used identical network architectures and hyperparameters across the same games to ensure fairness.' However, it does not provide specific details on how the data from the Atari and OpenAI Gym environments were split into training, validation, or test sets (e.g., percentages, exact counts, or specific split methodologies). |
| Hardware Specification | Yes | Testbed. We configured two hardware platforms for our experiments. The first platform is a CPU-only Slurm cluster with 8 computing nodes. Each node is equipped with 2 Intel Xeon Gold 6248 processors, providing a total of 40 physical cores per node. Each node has 384GB of memory, and they are interconnected using Connect X-6 high-speed interconnects. The second platform is a heterogeneous machine, equipped with one A100 GPU, 40 physical cores, and 376GB of memory. |
| Software Dependencies | No | The paper mentions software like 'RLlib', 'Ray', and 'gRPC library' but does not provide specific version numbers for any of these components, which are necessary for reproducible descriptions of ancillary software. |
| Experiment Setup | Yes | The optimal computational-resource mapping, identified by the distributed strategy, includes 2 learners and 4 actors, using 16 cores, denoted as L2A8-C16. A random computational-resource mapping, labeled L1A14C16, was also evaluated. After introducing asynchrony, simulations with serial sample distributions were conducted using two sample ratios: 1:1 (denoted as New) and 1:8 (denoted as Staleness). Four sets of control experiments were conducted. |