Solving Robust Markov Decision Processes: Generic, Reliable, Efficient
Authors: Tobias Meggendorfer, Maximilian Weininger, Patrick Wienhöft
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our prototype implementation outperforms existing tools by several orders of magnitude and can solve RMDPs with a million states in under a minute. 6 Experimental Evaluation We implemented a prototype in Java, based on PET (Meggendorfer and Weininger 2024). For linear optimization, we use the (pure Java) library oj! Algorithms. We ran our experiments on a machine with standard hardware (AMD Ryzen 5 CPU, 16GB RAM) running Linux Mint OS and using Open JDK 21 as JRE. Our tool, its source code, all models, and instructions to replicate all results can be found at (Meggendorfer 2024). Experimental Results Table 1 shows that our approach massively outperforms RVI, RRVI, and RRPI by several orders of magnitude. In particular, it seems that the runtime of their approaches grow significantly faster than ours. For example, going from lake-10-M to lake-15-M increases the runtime of RPPI by a factor of about 50, while ours only increases by a factor of 5. We believe this is partly due to implementation inefficiencies, but more importantly due the exponential space requirement of constructing the induced SG explicitly. In particular, just building the game structure for lake-100-U in their implementation takes 200s, and crashes with a memory-out on cont-100. In Table 2, we compare to the approach of PRISM. Our approach has runtimes in the same order of magnitude, with differences likely due to implementation details. However, recall that by design our tool needs to work more, as it provides guarantees. Thus, these results demonstrate that additionally obtaining guarantees via our approach does not drastically increase the runtime and scales to significantly sized models. |
| Researcher Affiliation | Academia | 1Lancaster University Leipzig, Leipzig, Germany 2Institute of Science and Technology Austria, Klosterneuburg, Austria 3Dresden University of Technology, Dresden, Germany |
| Pseudocode | Yes | Algorithm 1 Best-Effort Implicit Value Iteration for RMDP Algorithm 2 Bounded Value Iteration for RMDP |
| Open Source Code | Yes | Code https://zenodo.org/records/14385450 Our tool, its source code, all models, and instructions to replicate all results can be found at (Meggendorfer 2024). |
| Open Datasets | Yes | Finally, we modified several standard models from (Hartmanns et al. 2019) by adding rectangular constraints on selected transitions or adding Lp balls around all transitions. All considered (and further) models are included in the artefact. Our tool, its source code, all models, and instructions to replicate all results can be found at (Meggendorfer 2024). |
| Dataset Splits | No | The paper mentions using |
| Hardware Specification | Yes | We ran our experiments on a machine with standard hardware (AMD Ryzen 5 CPU, 16GB RAM) running Linux Mint OS and using Open JDK 21 as JRE. |
| Software Dependencies | No | We implemented a prototype in Java, based on PET (Meggendorfer and Weininger 2024). For linear optimization, we use the (pure Java) library oj! Algorithms. We ran our experiments on a machine with standard hardware (AMD Ryzen 5 CPU, 16GB RAM) running Linux Mint OS and using Open JDK 21 as JRE. |
| Experiment Setup | Yes | Guarantees and Runtime. We remark that RVI and RRVI as well as PRISM do not give a practical stopping criterion. For the former two, the implementation from (Chatterjee et al. 2024) aids them by stopping once the iterates are sufficiently close to the correct value, while PRISM stops once the iterates do not changes much between steps. Notably, PRISM s heuristic can indeed lead to stopping early and wrongly, see (Haddad and Monmege 2018). In contrast, our approach produces converging lower and upper bounds and thus also provides a correct stopping criterion. Computing both bounds until achieving precision of ε (10 6 in our experiments) naturally requires more effort than just working with one side and stopping at the correct time via an oracle or heuristics: Model Ours RVI RRVI RPPI cont-50 <1 223 <1 30 cont-75 <1 700 <1 70 cont-100 <1 M/O cont-125 1 M/O lake-10-U <1 T/O 7 lake-10-M <1 26 lake-15-U <1 T/O 136 lake-15-M 2 T/O lake-100-U 6 T/O T/O lake-100-M 12 T/O Table 1: Comparison of our approach for maximizing LRA to the approaches implemented in (Chatterjee et al. 2024) on the models used in that paper. We report solving times (excluding model building / parsing) in seconds. A dash indicates the approach does not support a particular model. T/O denotes a timeout of 15 minutes, M/O a memory-out crash. lake-n are their frozenlake models of size n n, the suffix U or M indicates the unichain or multichain variant. cont-n are the contamination models with n states. |