reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Navigating Social Dilemmas with LLM-based Agents via Consideration of Future Consequences

Authors: Dung Nguyen, Hung Le, Kien Do, Sunil Gupta, Svetha Venkatesh, Truyen Tran

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our first set of experiments, where LLM is directly asked to make decisions, shows that agents considering future consequences exhibit sustainable behaviour and achieve high common rewards for the population. Extensive experiments in complex environments showed that the CFC-Agent can manage a sequence of calls to LLM for reasoning and engaging in communication to cooperate with others to resolve the common dilemma better. Finally, our analysis showed that considering future consequences not only affects the final decision but also improves the conversations between LLM-based agents toward a better resolution of social dilemmas.
Researcher Affiliation	Academia	Applied Artificial Intelligence Initiative (A2I2), Deakin University EMAIL
Pseudocode	No	The paper describes methods and a framework, but does not include any clearly labeled pseudocode or algorithm blocks. It provides conceptual diagrams (Figure 1 and 2) to illustrate the agent's structure.
Open Source Code	No	The paper states: "The underlying LLMs are open-source large language models: (1) LLAMA-3.1-70B-it [Dubey et al., 2024]; (2) Qwen-2.5-72B-it [Team, 2024]." This refers to models used by the authors, not code released by them for their own methodology.
Open Datasets	No	The paper conducts experiments in "Common Harvest" and "Gov Sim environments [Piatti et al., 2024]" which are described as types of game environments. However, it does not provide concrete access information (e.g., specific links, DOIs, repositories, or formal citations for dataset download) for any specific dataset used or generated during their experiments.
Dataset Splits	No	The paper mentions experimental settings such as "two-player setting (20 runs)", "setting with 9 agents", and "maximum timestep of the game is 200". These are parameters for the simulation environment, not explicit training/test/validation dataset splits of pre-existing data.
Hardware Specification	No	The paper mentions using specific LLMs like "LLAMA-3.1-70B-it" and "Qwen-2.5-72B-it", implying the use of computational hardware, but it does not specify any particular GPU models, CPU types, or other hardware details used for running the experiments.
Software Dependencies	No	The paper states: "The underlying LLMs are open-source large language models: (1) LLAMA-3.1-70B-it [Dubey et al., 2024]; (2) Qwen-2.5-72B-it [Team, 2024]." These are the models being used/studied, not ancillary software components or libraries with specific version numbers (e.g., Python, PyTorch, CUDA) required to replicate the experimental environment.
Experiment Setup	Yes	In the Common Harvests environment, the agents have a memory with the size H = 5, i.e., they can remember 5 most recent experiences to make decisions; and all agents are augmented with rationale. ... We identified this range for LLAMA-3.1-70B-it (αLLAMA CFC [−0.6, 0.4]) and Qwen-2.5-72b-it (αQwen CFC [−5.0, 5.0)) under intervening over layers l [20, 60].