Revelations: A Decidable Class of POMDPs with Omega-Regular Objectives
Authors: Marius Belly, Nathanaël Fijalkow, Hugo Gimbert, Florian Horn, Guillermo A. Pérez, Pierre Vandenhove
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide a comparison between our algorithm and off-the-shelf deep reinforcement learning (DRL) trained via an observation wrapper. As we will show in the paper, the MDP induced by the belief supports carries sufficient information to play in revealing POMDPs; hence, we used a wrapper implementing a subset construction on the fly to generate the current belief support, and focused on algorithms intended for MDPs. Spending moderate effort on reward engineering and hyperparameter tuning, we have been unable to match the performance of our algorithm (see Figure 3). |
| Researcher Affiliation | Academia | 1CNRS, La BRI, Universit e de Bordeaux, France 2CNRS, IRIF, Universit e de Paris, France 3University of Antwerp Flanders Make, Antwerp, Belgium |
| Pseudocode | No | The paper describes theoretical results and algorithms, but it does not contain a clearly labeled pseudocode or algorithm block with structured steps. |
| Open Source Code | Yes | Code https://github.com/gaperez64/pomdps-reveal |
| Open Datasets | Yes | Here, we depict this value, per step (from 1 to 500) over 500 simulations of a revealing version of the classical tiger POMDP (Cassandra, Kaelbling, and Littman 1994). The example used will be discussed in Section 5, Example 2. ... We give an example of a strongly revealing POMDP inspired from the tiger of (Cassandra, Kaelbling, and Littman 1994), depicted in Figure 5. This example was used in Figure 3 in the introduction; the code to generate it in our tool is provided in (Belly et al. 2024, Appendix A). |
| Dataset Splits | No | The paper discusses simulations and evaluation of strategies in POMDPs, which do not typically involve dataset splits like those in supervised learning. It mentions "500 simulations" but not a training/test/validation split of a dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running experiments. |
| Software Dependencies | No | A2C, DQN, and PPO are (Mlp Policy) strategies obtained from the stable-baselines library (Raffin et al. 2021). The specific version number for stable-baselines is not provided. |
| Experiment Setup | No | A2C, DQN, and PPO are (Mlp Policy) strategies obtained from the stable-baselines library (Raffin et al. 2021), trained (for a total of 10k time steps) with default parameter values using a simple reward scheme: a good event yields a reward of 100; a bad one, 1. While the training duration and reward scheme are mentioned, the specific 'default parameter values' for the DRL algorithms are not explicitly stated. |