Reproducibility Study of "Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation"
Authors: Jose L. Garcia, Karolina Hajkova, Maria Marchenko, Carlos Miguel PatiƱo
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper presents a reproducibility study and extension of "Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation." We validate the original findings using a range of open-weight models (1.5B-70B parameters), GPT-4, and GPT4o Mini while introducing several novel contributions. We analyze the Pareto front of the games, propose a communication-free baseline to test whether successful negotiations are possible without agent interaction, evaluate recent small language models performance, analyze structural information leakage in model responses, and implement an inequality metric to assess negotiation fairness. |
| Researcher Affiliation | Academia | University of Amsterdam EMAIL |
| Pseudocode | No | The paper includes structured prompt variations in the appendix (Figures 8-13) but these are prompt templates and not pseudocode or algorithm blocks describing a computational procedure. |
| Open Source Code | Yes | The code and instructions for reproducing the results of this work, along with our additions and the fixes mentioned in Section A.9, are available on Github 1. https://github.com/cmpatino/llm_negotiation |
| Open Datasets | Yes | The benchmark s base negotiation game is adapted from Harbor Co (Susskind, 1985), a game traditionally used to teach negotiation skills... Our reproducibility goal is to assess whether the benchmark effectively evaluates the models negotiation performance and whether the Chain-of-Thought (Co T) prompt configurations contribute to successful dealmaking... This paper aims to reproduce the results of an LLM testbed specially designed for negotiation games (Abdelnabi et al., 2024) to contribute to developing robust benchmarks. |
| Dataset Splits | No | The paper describes running "10 experiments per model" (Table 2) and variations of a negotiation game. However, it does not specify traditional machine learning dataset splits (e.g., training, validation, test sets with percentages or counts) as it evaluates pre-trained LLMs on a negotiation benchmark, rather than training models on a dataset. |
| Hardware Specification | Yes | For proprietary models like GPT-4o Mini, we accessed the model via the Open AI API, which did not require additional computational resources on our end. For the open-source models, we used the Netherlands national supercomputer, Snellius, with access to NVIDIA A100 and H100 GPUs. |
| Software Dependencies | No | The paper mentions using "Hugging Face pipeline" and "EAR (Corbalan et al., 2020)" for energy monitoring. It also mentions adjusting "do_sample parameter" and setting "float16 precision" for Llama models. However, it does not provide specific version numbers for these software components or libraries, which are necessary for reproducible descriptions. |
| Experiment Setup | Yes | Originally, the authors set the model temperature to 0 and a random order of the agents for each round. However, the authors did not add a fixed seed in the code, making their exact results not reproducible. To address this, we added fixed seeds to the code setup. We then ran all the experiments 10 times with seeds ranging from 1 to 10. We also corrected the do_sample parameter in the Hugging Face pipeline, changing it from True to False to set up the correct greedy decoding configuration. |