Efficient Exploration in Multi-Agent Reinforcement Learning via Farsighted Self-Direction

Authors: Tiancheng Lao, Xudong Guo, Mengge Liu, Junjie Yu, Yi Liu, Wenhui Fan

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the method on didactic examples and demonstrate the outperformance of our method on challenging Star Craft II micromanagement tasks. ... Empirical results show that FSD achieves state-of-the-art performance compared to several widely adopted baseline methods. Notably, unlike previous curiosity-driven methods, FSD does not use an explorer, thereby saving a considerable amount of computational resources. Additionally, we demonstrate the effectiveness of Qint i and clipped double Q-learning separately through ablation studies.
Researcher Affiliation Academia Tiancheng Lao EMAIL Department of Automation Tsinghua University, Beijing, China; Xudong Guo EMAIL Department of Automation Tsinghua University, Beijing, China; Mengge Liu EMAIL Department of Automation Tsinghua University, Beijing, China; Junjie Yu EMAIL Department of Automation Tsinghua University, Beijing, China; Yi Liu EMAIL Department of Automation Tsinghua University, Beijing, China; Wenhui Fan EMAIL Department of Automation Tsinghua University, Beijing, China
Pseudocode No The paper describes the FSD method using mathematical equations and textual explanations, but it does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code No The paper states: 'Baselines are trained using their open-source codes, with some results derived from the open-source results of EXPODE (Zhang & Yu, 2023).' This refers to the code of other methods, not the authors' own implementation of FSD. There is no explicit statement or link provided for the FSD source code.
Open Datasets Yes We validate our method on didactic examples used in EMC (Zheng et al., 2021). ... Subsequently, we evaluated FSD on Predator Prey (Rashid et al., 2020) and several challenging Star Craft II micromanagement tasks (Samvelyan et al., 2019).
Dataset Splits No For Simultaneous Arrival, it mentions: 'the standard setting of the scenario lacks randomness, meaning that during evaluation, the win rate is either 0 or 1 for a fixed policy, and the curves essentially represent the proportion of wins across five different runs.' For Predator Prey and SMAC, the paper refers to the benchmarks themselves but does not provide explicit details on how the data was split into training, validation, or test sets for their experiments.
Hardware Specification Yes Experiments are conducted on eight NVIDIA RTX 4090s, with training time ranging from half an hour to 10 hours, depending on the complexity of the task and the number of agents involved.
Software Dependencies No The paper does not explicitly mention any specific software dependencies or library versions used for the implementation (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes In our experiments, we set p to a relatively high value of 3. ... As for evaluation, action is selected greedily based on the controller Qϕ i. All experiments have been repeated for five runs over different random seeds. ... Effect of exploration coefficient α. We also analyze the effect of the exploration coefficient α in eq. 4 on performance using FSD-sgl-VDN and FSD in the Predator Prey and the MMM2 map of SMAC, respectively. Fig. 8 shows that, in general, a coefficient between 0.01 and 0.1 can effectively improve exploration efficiency. In the relatively simple Predator-Prey scenario, where coordinated exploration is crucial, α can take a larger value, such as 1, to further enhance exploration. However, on the more complex MMM2 map, setting α = 1 leads to excessive exploration by the agents, which ultimately reduces their learning efficiency.