reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Counterfactual Effect Decomposition in Multi-Agent Sequential Decision Making

Authors: Stelios Triantafyllou, Aleksa Sukovic, Yasaman Zolfimoselo, Goran Radanovic

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experimentation, we demonstrate the interpretability of our approach in a Gridworld environment with LLM-assisted agents and a sepsis management simulator. We experimentally validate the interpretability of our approach using two multi-agent environments: a grid-world environment, where two RL actors are instructed by an LLM planner to complete a sequence of tasks, and the sepsis management simulator from Fig. 1.
Researcher Affiliation	Academia	1Max Planck Institute for Software Systems, Germany. Correspondence to: Stelios Triantafyllou <EMAIL>.
Pseudocode	Yes	Appendix G includes an algorithm for the approximation of the expected conditional variance of YI,ai,t. Algorithm 1 Estimates E[Var( YI,ai,t\|τ, U<Sk)\|τ]M
Open Source Code	Yes	Code to reproduce our experiments is available at https://github.com/stelios30/cf-effect-decomposition.git.
Open Datasets	No	We experimentally validate the interpretability of our approach using two multi-agent environments: a grid-world environment, where two RL actors are instructed by an LLM planner to complete a sequence of tasks, and the sepsis management simulator from Fig. 1. Our experimental setup and implementation closely follow that of (Triantafyllou et al., 2024).
Dataset Splits	No	Throughout both experiments, we use 100 posterior samples for estimating counterfactual effects and 20 additional ones for the conditional variance. We generate 600 trajectories with unsuccessful outcomes.
Hardware Specification	Yes	All experiments were run on a 64bit Debian-based machine having 2x12 CPU cores clocked at 3GHz with access to 1 TB of DDR3 1600MHz RAM and an NVIDIA A40 GPU.
Software Dependencies	Yes	The software stack relied on Python 3.9.13, with installed standard scientific packages for numeric calculations and visualization (we provide a full list of dependencies and their exact versions as part of our code).
Experiment Setup	Yes	We provide a full list of hyperparameters in Table 3. Table 3: Hyperparameters used for the Gridworld actors policies. Parameter name: Discount, Parameter value: 0.99. Parameter name: Target Update Freq., Parameter value: 1000. Parameter name: Batch size, Parameter value: 512. Parameter name: Learning Rate, Parameter value: 1e-4.