reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents

Authors: Jen-Tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Michael Lyu, Maarten Sap

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on four downstream tasks using six systems show that the hierarchical structure, i.e., A (B C), exhibits superior resilience with the lowest performance drop of 5.5%, compared to 10.5% and 23.7% of other two structures.
Researcher Affiliation	Academia	1Chinese University of Hong Kong 2Tsinghua University 3Carnegie Mellon University 4Peking University 5Renmin University of China 6Chinese University of Hong Kong, Shenzhen.
Pseudocode	No	The paper describes methodologies (AUTOTRANSFORM and AUTOINJECT) and provides illustrative examples in Figure 2, but it does not include a clearly labeled pseudocode block or algorithm section for these methods or any other part of the research.
Open Source Code	Yes	Our code and data are available at https://github.com/CUHK-ARISE/MAS-Resilience.
Open Datasets	Yes	Code Generation: Human Eval (Chen et al., 2021) contains 164 hand-written programming problems to assess LLMs ability to synthesize correct and functional Python code. Accuracy (Pass@1) is used for evaluation. Math Problem Solving: CIAR (Liang et al., 2024) presents 50 questions with hidden traps to evaluate LLMs Counter-Intuitive Arithmetic Reasoning abilities, requiring multi-step reasoning. Translation: Common MT (He et al., 2020) consists of paired sentences to test models handling of three types of commonsense reasoning, especially in ambiguous contexts. Text Evaluation: Fair Eval (Wang et al., 2024a) includes 80 human-annotated win/tie/lose labels comparing responses from Chat GPT and Vicuna-13B, aiming to determine if the model s preferences align with human judgments.
Dataset Splits	No	The paper mentions specific datasets and their sizes (e.g., Human Eval contains 164 problems, CIAR presents 50 questions, Common MT randomly sampled 100 sentences, Fair Eval includes 80 labels) and evaluates performance metrics on these. However, it does not explicitly provide information on how these datasets are split into training, validation, and test sets for the experiments, nor does it refer to standard predefined splits for reproducibility beyond simply naming the datasets.
Hardware Specification	No	The paper specifies the use of 'GPT-3.5' and 'GPT-4o' as the backbone LLMs for experiments. While these are large language models, the paper does not provide specific details about the underlying hardware (e.g., GPU models, CPU types, memory) on which these models were run or fine-tuned for the experiments.
Software Dependencies	No	The paper states that 'GPT-3.5' and 'GPT-4o' are used as backbone LLMs and that 'All LLMs are used with a temperature of zero.' This identifies the core language models and one hyperparameter. However, it does not list other ancillary software dependencies, such as specific programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or their respective version numbers, which are typically needed for full reproducibility.
Experiment Setup	Yes	GPT-3.5 is consistently used for both AUTOTRANSFORM and AUTOINJECT to ensure a fair comparison. We use GPT-3.5 and GPT-4o as the backbone for these systems for main experiments (RQ1 and RQ2) while using GPT-3.5 for factor analysis. All LLMs are used with a temperature of zero. We introduce one faulty agent at a time to avoid interference and facilitate essential analysis, which is shown in Table 1.