Revelations: A Decidable Class of POMDPs with Omega-Regular Objectives

Authors: Marius Belly, Nathanaël Fijalkow, Hugo Gimbert, Florian Horn, Guillermo A. Pérez, Pierre Vandenhove

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide a comparison between our algorithm and off-the-shelf deep reinforcement learning (DRL) trained via an observation wrapper. As we will show in the paper, the MDP induced by the belief supports carries sufficient information to play in revealing POMDPs; hence, we used a wrapper implementing a subset construction on the fly to generate the current belief support, and focused on algorithms intended for MDPs. Spending moderate effort on reward engineering and hyperparameter tuning, we have been unable to match the performance of our algorithm (see Figure 3).
Researcher Affiliation Academia 1CNRS, La BRI, Universit e de Bordeaux, France 2CNRS, IRIF, Universit e de Paris, France 3University of Antwerp Flanders Make, Antwerp, Belgium
Pseudocode No The paper describes theoretical results and algorithms, but it does not contain a clearly labeled pseudocode or algorithm block with structured steps.
Open Source Code Yes Code https://github.com/gaperez64/pomdps-reveal
Open Datasets Yes Here, we depict this value, per step (from 1 to 500) over 500 simulations of a revealing version of the classical tiger POMDP (Cassandra, Kaelbling, and Littman 1994). The example used will be discussed in Section 5, Example 2. ... We give an example of a strongly revealing POMDP inspired from the tiger of (Cassandra, Kaelbling, and Littman 1994), depicted in Figure 5. This example was used in Figure 3 in the introduction; the code to generate it in our tool is provided in (Belly et al. 2024, Appendix A).
Dataset Splits No The paper discusses simulations and evaluation of strategies in POMDPs, which do not typically involve dataset splits like those in supervised learning. It mentions "500 simulations" but not a training/test/validation split of a dataset.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running experiments.
Software Dependencies No A2C, DQN, and PPO are (Mlp Policy) strategies obtained from the stable-baselines library (Raffin et al. 2021). The specific version number for stable-baselines is not provided.
Experiment Setup No A2C, DQN, and PPO are (Mlp Policy) strategies obtained from the stable-baselines library (Raffin et al. 2021), trained (for a total of 10k time steps) with default parameter values using a simple reward scheme: a good event yields a reward of 100; a bad one, 1. While the training duration and reward scheme are mentioned, the specific 'default parameter values' for the DRL algorithms are not explicitly stated.