Optimally Solving Simultaneous-Move Dec-POMDPs: The Sequential Central Planning Approach
Authors: Johan Peralez, Aurélien Delage, Jacopo Castellini, Rafael F. Cunha, Jilles S. Dibangoye
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on two-as well as many-agent domains from the literature against ϵ-optimal simultaneous-move solvers confirm the superiority of our novel approach. This section presents the results of our experiments, which were carried out to juxtapose the sequential planning approach with its simultaneous counterpart employed in many leading-edge global multi-agent planning and reinforcement learning algorithms, encompassing the utilization of o SARSA, cf. Algorithm 1, as a standard algorithmic scheme. |
| Researcher Affiliation | Academia | 1Univ Lyon, INSA Lyon, Inria, CITI, EA3720, 69621 Villeurbanne, France 2Haute Ecole de Gestion de Gen eve, University of Applied Sciences and Arts Western, 1227 Carouge, Geneva, Switzerland 3Bernoulli Institute, University of Groningen, Nijenborgh 4, NL-9747AG, Groningen, Netherlands EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: o SARSA(ϵ). 1 Init a0:ℓ 1 blind policy, α ( ) 0, g . 2 foreach episode do 3 Init SOC s0 b0 and step τ0 0. 4 for τ = 0 to ℓ 1 do 5 Select ϵ-greedily asτ τ w.r.t. ατ+1. 6 Compute SOC sτ+1 T(sτ, asτ τ ). 7 if Accept Accept Accept(asτ τ , gτ+1, ατ+1(sτ+1)) then 8 a0:ℓ 1 a0:τ 1, asτ τ , aτ+1:ℓ 1 . 9 gτ+1 ατ+1(sτ+1). 11 for τ = τ0 to 0 do 12 Update ατ backward. |
| Open Source Code | Yes | The source code is available at https://git.lwp.rug.nl/ml-rug/osarsa-aaai-25. |
| Open Datasets | Yes | We have comprehensively assessed various algorithms using several two-agent benchmarks from academic literature, available at masplan.org. These benchmarks include mabc, recycling, grid3x3, boxpushing, mars, and tiger. |
| Dataset Splits | No | No specific dataset splits (e.g., train/test/validation percentages or counts) are provided. The paper describes experiments on various multi-agent domains with different planning horizons and uses an episodic reinforcement learning algorithm, which typically involves interaction with an environment rather than fixed dataset splits. |
| Hardware Specification | Yes | All experiments were run on an Ubuntu machine with 16 GB of available RAM and an 1.8 GHz processor, using only one core and a time limit of 1 hour. |
| Software Dependencies | No | The paper mentions using 'ILOG CPLEX Optimization Studio' but does not specify a version number. No other software dependencies are listed with specific version information. |
| Experiment Setup | Yes | Algorithm 1 proceeds by generating trajectories of SOCs instead of OCs, one trajectory per episode, guided by sequential ϵ-greedy exploration steps. We use a portfolio of heuristic policies to guide the selection of the next sequential action to enhance exploration. This portfolio includes the random policy, the underlying MDP policy, and the blind policy (Hauskrecht 2000). We introduce an acceptance rule based on Simulated Annealing (SA) (Rutenbar 1989) to avoid the algorithm being stuck in a local optimum. Therein, a modification of the current sequential policy is kept not only if the performance improves but also in the opposite case, with a probability depending on the loss and on a temperature coefficient decreasing with time. |