Learning to Communicate Through Implicit Communication Channels

Authors: Han Wang, Binbin Chen, zhang, Baoxiang Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness of ICP through comprehensive experiments on the tasks of Guessing Numbers, Revealing Goals, and Hanabi (Bard et al., 2020).
Researcher Affiliation Collaboration Han Wang The Chinese University of Hong Kong, Shenzhen EMAIL Binbin Chen Byte Dance Inc. EMAIL Tieying Zhang Byte Dance Inc. EMAIL Baoxiang Wang The Chinese University of Hong Kong, Shenzhen Vector Institute EMAIL
Pseudocode Yes Algorithm 1: ICP implementation with DIAL and VDN
Open Source Code Yes The code and our designed environments are freely available in the supplementary material.
Open Datasets Yes We validate the effectiveness of ICP through comprehensive experiments on the tasks of Guessing Numbers, Revealing Goals, and Hanabi (Bard et al., 2020). These environments share a common characteristic: they lack direct communication but agents must collaboratively make decisions to achieve shared rewards. This setting introduces significant challenges, including sparse and delayed reward feedback, along with difficulty in credit assignment both temporally and among agents. Despite these hurdles, our experiments on Guessing Numbers and Revealing Goals demonstrate that ICP significantly enhances performance, through more efficient information transmission, compared to baseline methods. In Hanabi, which is a popular card game played by humans, our approach achieved an average score of 24.91 out of 25, which surpasses the best available learning algorithm which obtains 23.81.
Dataset Splits No In the Guessing Numbers experiment, we evaluate the performance of 5 approaches: VDN-on-policy, VDN-off-policy, ICP with the random initial map approach (ICN-DIAL-RM), ICP with the delayed map approach (ICN-DIAL-DM), and a cheating approach where a direct communication channel is available (DIAL-Cheat). Each approach is evaluated over 1k episodes with 6 random seeds, and running on a Linux metal machine with 256 GB RAM and 3090Ti GPU for 36 hours. For VDN-off-policy, we begin by warming up the replay buffer until its size exceeds the batch size. During each training step, we add 10 episodes to the replay buffer and randomly sample a batch of episodes from the buffer for training. In contrast, for VDN-on-policy and our proposed method, we utilize a vectorized environment to sample a batch of episodes at each training step and use these samples for training.
Hardware Specification Yes Each approach is evaluated over 1k episodes with 6 random seeds, and running on a Linux metal machine with 256 GB RAM and 3090Ti GPU for 36 hours.
Software Dependencies No No specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow) are provided in the paper.
Experiment Setup Yes Specifically, we set the hidden size of the MLP and GRU to 256, use 2 layers in the GRU, and set the learning rate to 5 10 4, batch size to 256. The target network update rate is set to 10, γ is set to 0.99, ϵ is set to 0.1, and we apply gradient clipping with a threshold of 10.