Referential communication in heterogeneous communities of pre-trained visual deep networks

Authors: Matéo Mahaut, Roberto Dessi, Francesca Franzon, Marco Baroni

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental After reviewing related work (Section 2) and presenting our general setup (Section 3), we delve into our experiments in Section 4. First, in Section 4.1, we show that it is indeed possible for sets of heterogeneous pre-trained networks to successfully converge on a referent through an induced communication protocol. In Section 4.2, we study referential generalization, showing that the developed protocol is sufficiently flexible that the networks can use it to refer to objects that were not seen during the training phase... Tables 3 and 4 show that communication is at least partially successful at a more granular level than Image Net1k classes... Table 5 shows that 64-dimensional communication is still possible in this zero-shot dataset-transfer experiment...
Researcher Affiliation Academia Matéo Mahaut EMAIL Francesca Franzon EMAIL Roberto Dessì EMAIL Universitat Pompeu Fabra Marco Baroni EMAIL Universitat Pompeu Fabra and ICREA
Pseudocode Yes Pseudocode for the referential game is presented in Appendix A. A The one-to-one referential game in pseudocode
Open Source Code Yes 1https://github.com/facebookresearch/EGG, scripts from our experiments are in ./egg/zoo/pop
Open Datasets Yes As nearly all agents rely on vision modules pre-trained on the ILSVRC2012 training set, we sample images from the validation data of that same dataset (50,000 images)... Thus, our Image Net1k communication training and testing sets are both extracted from the original ILSVRC2012 validation set. Agents are also tested on an out-of-domain (OOD) dataset containing classes from the larger Image Net21k repository, as pre-processed by Ridnik et al. (2021)... We further test the agents that were communication-trained on Image Net1k on 3 different datasets... 1) Cifar100 (Krizhevsky et al., 2009)... 2) Places205 (Zhou et al., 2014)... 3) Celeb A: (Liu et al., 2015)... We provide scripts to reproduce our Image Net1k and OOD datasets at https://github.com/mahautm/emecom_pop_data
Dataset Splits Yes we sample images from the validation data of that same dataset (50,000 images) to teach them to play the referential game, while reserving 10% of those images for testing (note that we do not use image annotations). Thus, our Image Net1k communication training and testing sets are both extracted from the original ILSVRC2012 validation set. We used 90% of the OOD data for testing OOD communication accuracy (Section 4.2 below) and to train the classifiers in the experiments reported in Appendix H. The remaining 10% was used to test the latter classifiers. Batch size is set at 64, the largest value we could robustly fit in GPU memory. As we sample distractors directly from training batches, on each training step the referential game is played 64 times, once with every different image in the batch as target and the other 63 images serving as distractors.
Hardware Specification Yes Each experiments was conducted using a single NVIDIA A30 GPU.
Software Dependencies No All experiments are implemented using the EGG toolkit (Kharitonov et al., 2019). Our version of CLIP uses the Vi T architecture for image encoding (we use the Vi T based clip model from the "Py Torch Image Models" library (Wightman, 2019)...
Experiment Setup Yes Batch size is set at 64, the largest value we could robustly fit in GPU memory... All parameters needed to reproduce our experiments with the toolkit (both the continuous setup reported here, and the discrete one discussed in Appendix C) can be found in Appendix E. Table 13: Hyperparameters for training continuous communication channels. Hyperparameters batch size 64 optimizer Adam learning rate 1e-4 max message length 1 non linearity sigmoid vocab size 16 / 64 receiver hidden dimension 2048 image size 384 receiver cosine temperature 0.1