reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Communicating Activations Between Language Model Agents

Authors: Vignav Ramesh, Kenneth Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test our method with various functional forms f on two experimental setups multi-player coordination games and reasoning benchmarks and find that it achieves up to 27.0% improvement over natural language communication across datasets with <1/4 the compute, illustrating the superiority and robustness of activations as an alternative language for communication between LMs.
Researcher Affiliation	Academia	1Kempner Institute for AI, Harvard University, Cambridge, MA, USA. Correspondence to: Vignav Ramesh <EMAIL>.
Pseudocode	No	The paper describes the method and procedure in prose and uses Figure 1 for illustration, but does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide a specific link to source code, nor does it contain an explicit statement about the release of code in supplementary materials or other repositories.
Open Datasets	Yes	We validate our method by testing this approach with various functional forms f on two experimental setups: two multiplayer coordination games... and seven reasoning benchmarks spanning multiple domains: Biographies (Du et al., 2023), GSM8k (Cobbe et al., 2021), MMLU High School Psychology, MMLU Formal Logic, MMLU College Biology, MMLU Professional Law, and MMLU Public Relations (Hendrycks et al., 2021).
Dataset Splits	Yes	We evaluate on a randomly-sampled size-100 subset of each dataset. ... Indeed, we verify this hypothesis by training W on the GSM8k train set (to produce Win dist) and then evaluating with this task-specific linear layer on the GSM8k test set.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions using the 'Adam optimizer' but does not specify version numbers for any software, libraries, or programming languages used in the implementation.
Experiment Setup	Yes	Across all experiment configurations, we fix the decoding strategy to nucleus sampling with p = 0.9. ... In experiments involving the mapping matrix W , we instantiate W R4096 3072 using Xavier initialization and train for 10 epochs on a dataset of 3072 sentences... We use batch size 32 and the Adam optimizer with learning rate 0.001.