Reproducibility Study of "Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation"

Authors: Jose L. Garcia, Karolina Hajkova, Maria Marchenko, Carlos Miguel PatiƱo

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper presents a reproducibility study and extension of "Cooperation, Competition, and Maliciousness: LLM-Stakeholders Interactive Negotiation." We validate the original findings using a range of open-weight models (1.5B-70B parameters), GPT-4, and GPT4o Mini while introducing several novel contributions. We analyze the Pareto front of the games, propose a communication-free baseline to test whether successful negotiations are possible without agent interaction, evaluate recent small language models performance, analyze structural information leakage in model responses, and implement an inequality metric to assess negotiation fairness.
Researcher Affiliation Academia University of Amsterdam EMAIL
Pseudocode No The paper includes structured prompt variations in the appendix (Figures 8-13) but these are prompt templates and not pseudocode or algorithm blocks describing a computational procedure.
Open Source Code Yes The code and instructions for reproducing the results of this work, along with our additions and the fixes mentioned in Section A.9, are available on Github 1. https://github.com/cmpatino/llm_negotiation
Open Datasets Yes The benchmark s base negotiation game is adapted from Harbor Co (Susskind, 1985), a game traditionally used to teach negotiation skills... Our reproducibility goal is to assess whether the benchmark effectively evaluates the models negotiation performance and whether the Chain-of-Thought (Co T) prompt configurations contribute to successful dealmaking... This paper aims to reproduce the results of an LLM testbed specially designed for negotiation games (Abdelnabi et al., 2024) to contribute to developing robust benchmarks.
Dataset Splits No The paper describes running "10 experiments per model" (Table 2) and variations of a negotiation game. However, it does not specify traditional machine learning dataset splits (e.g., training, validation, test sets with percentages or counts) as it evaluates pre-trained LLMs on a negotiation benchmark, rather than training models on a dataset.
Hardware Specification Yes For proprietary models like GPT-4o Mini, we accessed the model via the Open AI API, which did not require additional computational resources on our end. For the open-source models, we used the Netherlands national supercomputer, Snellius, with access to NVIDIA A100 and H100 GPUs.
Software Dependencies No The paper mentions using "Hugging Face pipeline" and "EAR (Corbalan et al., 2020)" for energy monitoring. It also mentions adjusting "do_sample parameter" and setting "float16 precision" for Llama models. However, it does not provide specific version numbers for these software components or libraries, which are necessary for reproducible descriptions.
Experiment Setup Yes Originally, the authors set the model temperature to 0 and a random order of the agents for each round. However, the authors did not add a fixed seed in the code, making their exact results not reproducible. To address this, we added fixed seeds to the code setup. We then ran all the experiments 10 times with seeds ranging from 1 to 10. We also corrected the do_sample parameter in the Hugging Face pipeline, changing it from True to False to set up the correct greedy decoding configuration.