Collaboration with Dynamic Open Ad Hoc Team via Team State Modelling
Authors: Jing Sun, Cong Zhang, Zhiguang Cao
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across four challenging multi-agent benchmark tasks Level-Based Foraging, Wolf-Pack, Cooperative Navigation, and Fort Attack demonstrate that our method successfully enables dynamic teamwork in open ad hoc settings. Open-OTAF outperforms state-of-the-art methods, achieving superior performance with faster convergence. These results validate Open-OTAF s capacity to adapt to unknown teammates while maintaining computational efficiency. |
| Researcher Affiliation | Academia | Jing Sun EMAIL Faculty of Data Science City University of Macau Cong Zhang EMAIL College of Computing and Data Science Nanyang Technological University Zhiguang Cao EMAIL School of Computing and Information Systems Singapore Management University |
| Pseudocode | Yes | We summarize our training procedure in Algorithm 1 in the Appendix A.1. ... Algorithm 1 Training Procedure |
| Open Source Code | No | The paper does not contain an explicit statement about the release of source code or a link to a code repository. The Open Review link is for the peer review process. |
| Open Datasets | Yes | Extensive experiments across four challenging multi-agent benchmark tasks Level-Based Foraging, Wolf-Pack, Cooperative Navigation, and Fort Attack demonstrate that our method successfully enables dynamic teamwork in open ad hoc environments. ... We select four multi-agent tasks as our environments, as shown in Figure A.2 in the Appendix A.2.1. Among them, Level Based Foraging (LBF), Wolfpack and Fortattack are three scenarios coming from (Rahman et al., 2021b). LBF is a cooperative grid world game with agents that are rewarded if they concurrently navigate to the food and collect it. In Wolfpack, multiple agents (predators) need to chase and encounter the adversary agent (prey) to win the game. The Fort Attack environment defines a bounded two-dimensional space where agents are constrained within specific coordinate ranges. We also evaluate our method in the penalized cooperative navigation, coming from the MPE environment (Lowe et al., 2017), where multiple agents are trained to move towards landmarks while avoiding collisions with each other. |
| Dataset Splits | Yes | For each environment, we design 20 different policies, and we have one of 10 policies as the training set and the other 10 policies as the testing set. Specifically, in the Wolfpack environment, we used the 9 heuristic policies proposed by (Barrett et al., 2011) along with an A2C algorithm (Mnih et al., 2015) to control teammates. ... Open-OTAF was initially trained on 5 teammates with 10 different training policies and then tested on 3, 5, 6, 8 and 10 teammates with 10 different testing teammate policies. |
| Hardware Specification | Yes | All experiments are carried out in a machine with Intel Core i9-10940X CPU and a single Nvidia Ge Force 2080Ti GPU. |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer' and algorithms like 'A2C algorithm' and 'DQN', but does not specify version numbers for any programming languages, libraries, or frameworks used (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We report the performance in terms of average returns (solid line) and the standard deviation (shaded areas) with 7 random seeds, as summarized in Figure 5 and Table 1. ... The parameters of the networks are updated by Adam optimizer (Kingma & Ba, 2015) with a learning rate 1e-4 for all environments. The discounting factor is γ = 0.99, learning rate α and the concentration hyperparameter is λ = 1. Since they induce the best performance compared with other values. For exploration, we use ϵ greedy from 1.0 to 0.05. Batches of 128 episodes are sampled from the replay buffer, and all components in the framework are trained together in an end-to-end fashion. More details of the hyper-parameters are provided in Table A.3. ... Table A.3: Hyperparameters in the experiments (T 640,0000, α 1, γ 0.99, Optimizer Adam, Batch size 128, Learning rate 1e-4, Evaluation interval 3200, Action selector ε greedy, ε start 1, ε finish 0.05, Replay memory size 5000) |