reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cowpox: Towards the Immunity of VLM-based Multi-Agent Systems

Authors: Yutong Wu, Jie Zhang, Yiming Li, Chao Zhang, Qing Guo, Han Qiu, Nils Lukas, Tianwei Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of COWPOX empirically and provide theoretical robustness guarantees. The code can be found via https: //github.com/WU-YU-TONG/Cowpox. ... We conduct extensive experiments to verify the effectiveness of our COWPOX mechanism and its resistance to potential adaptive attacks.
Researcher Affiliation	Academia	1College of Computing and Data Science, Nanyang Technological University, Singapore, Singapore. 2CFAR and IHPC, Agency for Science, Technology and Research, Singapore. 3Network and Information Security Lab, Tsinghua University, Beijing, China. 4Mohamed bin Zayed University of Artificial Intelligence, Masdar City, Abu Dhabi. Correspondence to: Jie Zhang <EMAIL>.
Pseudocode	Yes	Algorithm 1 COWPOX (in one chat round)
Open Source Code	Yes	The code can be found via https: //github.com/WU-YU-TONG/Cowpox.
Open Datasets	Yes	We test the performance of each agent in the system by prompting every agent a request sampled from a subset of LLaVA-Bench (Liu et al., 2024a) and use GPT-4o to score their outputs. ... The evaluation is conducted on a combination of malicious outputs from Advbench and normal(benign) outputs from the ordinary chat history of our agents. ... We randomly select 200 samples from the full album of all the agents (Gu et al., 2024) and generate the virus samples based on them.
Dataset Splits	No	The paper specifies parameters for the multi-agent system, such as history length \|H\|=3 and album size \|B\|=10, and states that experiments last for 64 chat rounds. It mentions using 'a subset of LLaVA-Bench' and 'malicious outputs from Advbench and normal(benign) outputs from the ordinary chat history', and generating '200 virus samples', but it does not provide explicit training/validation/test splits for any dataset.
Hardware Specification	No	To simplify the implementation and due to the limitation in computational resources, all of the agents in the system query the same model during the experiment. The paper does not provide specific hardware details like GPU/CPU models or memory amounts.
Software Dependencies	No	We mainly exploit the Llava-1.57B (Liu et al., 2024a) as the base model of the multi-agent system and utilize CLIP (Radford et al., 2021) to construct the RAG module. ... We use GPT-4o to score their outputs. The paper mentions specific models/APIs used (Llava-1.57B, CLIP, GPT-4o) but does not list versions for programming languages, libraries, or other software dependencies.
Experiment Setup	Yes	Base VLM Model. We mainly exploit the Llava-1.57B (Liu et al., 2024a) as the base model of the multi-agent system and utilize CLIP (Radford et al., 2021) to construct the RAG module. ... Multi-Agent System. ...the history length \|H\| for each agent is set to 3, and the album size is kept as 10 if it is not exclusively mentioned. All the experiments last for 64 chat rounds. ... We vary the number of COWPOX agents κ from 0 to 16. We keep N = 128, \|H\| = 3, \|B\| = 10 in these experiment. All the chats last 64 epochs. ... We conducted the experiments on the system with 128 high-diversity agents.