Robust Finite-Memory Policy Gradients for Hidden-Model POMDPs

Authors: Maris F. L. Galesloot, Roman Andriushchenko, Milan Ceska, Sebastian Junges, Nils Jansen

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The empirical evaluation shows that, compared to various baselines, our approach (1) produces policies that are more robust and generalize better to unseen POMDPs, and (2) scales to HM-POMDPs that consist of over a hundred thousand environments. ... 5 Experimental Evaluation In this section, we evaluate RFPG on the following questions. (Q1) Does RFPG produce policies with higher robust performance compared to several baselines? (Q2) Can RFPG generalize to unseen environments? (Q3) How does the POMDP selection affect performance?
Researcher Affiliation Academia 1Radboud University Nijmegen, The Netherlands 2Brno University of Technology, Czechia 3Ruhr-University Bochum, Germany EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1: The RFPG algorithm
Open Source Code Yes Code is on Zenodo (https://doi.org/10.5281/zenodo.15479642) and the paper with appendix is on ar Xiv [Galesloot et al., 2025].
Open Datasets Yes We extend four POMDP benchmarks [Littman et al., 1997; Norman et al., 2017; Qiu et al., 1999] and one family of MDPs [Andriushchenko et al., 2024] to HM-POMDPs. These benchmarks together encompass a varied selection of different complexities of HM-POMDPs, i.e., different numbers of POMDPs and sizes thereof, as reported in Table 1. Appendix C gives a detailed description of the benchmarks.
Dataset Splits Yes (1) Pick a random subset of ten POMDPs from the full HM-POMDP, (2) compute a robust policy for this smaller sub-HM-POMDP using the four baselines and RFPG (referred to as RFPG-S), (3) compare the achieved robust performance of RFPG to the baselines on this sub-HM-POMDP (Q1). ... (5) compare the robust performance of the resulting six policies on the full HMPOMDP using the policy evaluation method from Section 4.3. From this experiment, we can not only assess the scalability of our approach compared to the baselines but, moreover, the ability to generalize to unseen environments (Q2). Additionally, we can see if RFPG produces a better robust performance than RFPG-S, indicating whether it is essential to assess all POMDPs within an HM-POMDP. ... To report statistically significant results, each experiment was carried out on 10 different subsets obtained using stratified sampling from the full HM-POMDP.
Hardware Specification No The paper mentions "Appendix D provides information on the infrastructure used to run the experiments." However, the provided text does not contain specific hardware details like GPU/CPU models or memory specifications.
Software Dependencies No The paper refers to tools like PAYNT [Andriushchenko et al., 2021] and SAYNT [Andriushchenko et al., 2023] but does not specify their version numbers or other software dependencies with versions.
Experiment Setup Yes GASTEPS is a hyperparameter that should be tuned based on the size of the HM-POMDP: having many instances |I| slows down the policy evaluation, while many states |S| slows down the gradient update steps. In our experiments, we picked GASTEPS = 10, such that at most 75% of the computation time is spent on policy evaluation. ... All methods have a one-hour timeout to compute a policy; in case of a timeout, we report the robust performance of a uniform random policy.