Robust Finite-Memory Policy Gradients for Hidden-Model POMDPs
Authors: Maris F. L. Galesloot, Roman Andriushchenko, Milan Ceska, Sebastian Junges, Nils Jansen
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The empirical evaluation shows that, compared to various baselines, our approach (1) produces policies that are more robust and generalize better to unseen POMDPs, and (2) scales to HM-POMDPs that consist of over a hundred thousand environments. ... 5 Experimental Evaluation In this section, we evaluate RFPG on the following questions. (Q1) Does RFPG produce policies with higher robust performance compared to several baselines? (Q2) Can RFPG generalize to unseen environments? (Q3) How does the POMDP selection affect performance? |
| Researcher Affiliation | Academia | 1Radboud University Nijmegen, The Netherlands 2Brno University of Technology, Czechia 3Ruhr-University Bochum, Germany EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: The RFPG algorithm |
| Open Source Code | Yes | Code is on Zenodo (https://doi.org/10.5281/zenodo.15479642) and the paper with appendix is on ar Xiv [Galesloot et al., 2025]. |
| Open Datasets | Yes | We extend four POMDP benchmarks [Littman et al., 1997; Norman et al., 2017; Qiu et al., 1999] and one family of MDPs [Andriushchenko et al., 2024] to HM-POMDPs. These benchmarks together encompass a varied selection of different complexities of HM-POMDPs, i.e., different numbers of POMDPs and sizes thereof, as reported in Table 1. Appendix C gives a detailed description of the benchmarks. |
| Dataset Splits | Yes | (1) Pick a random subset of ten POMDPs from the full HM-POMDP, (2) compute a robust policy for this smaller sub-HM-POMDP using the four baselines and RFPG (referred to as RFPG-S), (3) compare the achieved robust performance of RFPG to the baselines on this sub-HM-POMDP (Q1). ... (5) compare the robust performance of the resulting six policies on the full HMPOMDP using the policy evaluation method from Section 4.3. From this experiment, we can not only assess the scalability of our approach compared to the baselines but, moreover, the ability to generalize to unseen environments (Q2). Additionally, we can see if RFPG produces a better robust performance than RFPG-S, indicating whether it is essential to assess all POMDPs within an HM-POMDP. ... To report statistically significant results, each experiment was carried out on 10 different subsets obtained using stratified sampling from the full HM-POMDP. |
| Hardware Specification | No | The paper mentions "Appendix D provides information on the infrastructure used to run the experiments." However, the provided text does not contain specific hardware details like GPU/CPU models or memory specifications. |
| Software Dependencies | No | The paper refers to tools like PAYNT [Andriushchenko et al., 2021] and SAYNT [Andriushchenko et al., 2023] but does not specify their version numbers or other software dependencies with versions. |
| Experiment Setup | Yes | GASTEPS is a hyperparameter that should be tuned based on the size of the HM-POMDP: having many instances |I| slows down the policy evaluation, while many states |S| slows down the gradient update steps. In our experiments, we picked GASTEPS = 10, such that at most 75% of the computation time is spent on policy evaluation. ... All methods have a one-hour timeout to compute a policy; in case of a timeout, we report the robust performance of a uniform random policy. |