reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reproducibility Study of "Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation"

Authors: Jose L. Garcia, Karolina Hajkova, Maria Marchenko, Carlos Miguel Patiño

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper presents a reproducibility study and extension of "Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation." We validate the original findings using a range of open-weight models (1.5B-70B parameters), GPT-4, and GPT4o Mini while introducing several novel contributions. We analyze the Pareto front of the games, propose a communication-free baseline to test whether successful negotiations are possible without agent interaction, evaluate recent small language models performance, analyze structural information leakage in model responses, and implement an inequality metric to assess negotiation fairness.
Researcher Affiliation	Academia	University of Amsterdam EMAIL
Pseudocode	No	The paper includes structured prompt variations in the appendix (Figures 8-13) but these are prompt templates and not pseudocode or algorithm blocks describing a computational procedure.
Open Source Code	Yes	The code and instructions for reproducing the results of this work, along with our additions and the fixes mentioned in Section A.9, are available on Github 1. https://github.com/cmpatino/llm_negotiation
Open Datasets	Yes	The benchmark s base negotiation game is adapted from Harbor Co (Susskind, 1985), a game traditionally used to teach negotiation skills... Our reproducibility goal is to assess whether the benchmark effectively evaluates the models negotiation performance and whether the Chain-of-Thought (Co T) prompt configurations contribute to successful dealmaking... This paper aims to reproduce the results of an LLM testbed specially designed for negotiation games (Abdelnabi et al., 2024) to contribute to developing robust benchmarks.
Dataset Splits	No	The paper describes running "10 experiments per model" (Table 2) and variations of a negotiation game. However, it does not specify traditional machine learning dataset splits (e.g., training, validation, test sets with percentages or counts) as it evaluates pre-trained LLMs on a negotiation benchmark, rather than training models on a dataset.
Hardware Specification	Yes	For proprietary models like GPT-4o Mini, we accessed the model via the Open AI API, which did not require additional computational resources on our end. For the open-source models, we used the Netherlands national supercomputer, Snellius, with access to NVIDIA A100 and H100 GPUs.
Software Dependencies	No	The paper mentions using "Hugging Face pipeline" and "EAR (Corbalan et al., 2020)" for energy monitoring. It also mentions adjusting "do_sample parameter" and setting "float16 precision" for Llama models. However, it does not provide specific version numbers for these software components or libraries, which are necessary for reproducible descriptions.
Experiment Setup	Yes	Originally, the authors set the model temperature to 0 and a random order of the agents for each round. However, the authors did not add a fixed seed in the code, making their exact results not reproducible. To address this, we added fixed seeds to the code setup. We then ran all the experiments 10 times with seeds ranging from 1 to 10. We also corrected the do_sample parameter in the Hugging Face pipeline, changing it from True to False to set up the correct greedy decoding configuration.