An Oracle-Guided Approach to Constrained Policy Synthesis Under Uncertainty

Authors: Roman Andriushchenko, Milan Češka, Filip Macák, Sebastian Junges, Joost-Pieter Katoen

JAIR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section presents an experimental evaluation of our approach. It mainly focuses on three questions: Q1 Is our approach able to solve interesting problems from different planning and decisionmaking domains? To answer this, we consider three case studies: i) synthesising a power manager from a rough sketch, ii) constrained planning in a maze POMDP and iii) solving a meeting-grid Dec-POMDP. We discuss and interpret the resulting policies.
Researcher Affiliation Academia Roman Andriushchenko EMAIL Milan Češka EMAIL Filip Macák EMAIL Brno University of Technology, Brno, Czech Republic Sebastian Junges EMAIL Radboud University, Nijmegen, The Netherlands Joost-Pieter Katoen EMAIL RWTH Aachen University, Aachen, Germany
Pseudocode Yes Algorithm 1 summarises the key steps of the hybrid synthesis algorithm for the maximal reach-probability problem (see Section 2.3). We start (l.1) by creating the sub-space stack (initially containing only ) and construct the formula φ describing each unexplored parameter assignment from . In each iteration of the main loop, we explore exactly one sub-space i. We first run one iteration of AR (l.4), which may update the current optimum δ . This call also yields probability reachability bounds for i. If this one iteration proves that the sub-space does not contain a better solution than (updated) δ , we continue with the next sub-space. Otherwise, we set the time limit for CEG (l.6) based on the previous performance of AR and CEG. When exploring i, CEG uses boundsi to construct counterexamples modulo bounds. CEG terminates either by exploring all assignments from i or by reaching the allocated time limit. In the latter case (l.9), we split i into sub-families to be analysed in subsequent iterations. To balance AR vs. CEG reasoning, we track the number ηAR and ηCEG of assignments pruned per time unit by AR and CEG, respectively. Let τAR and τCEG denote the current total run-times of AR and CEG; then, on l.6 we allocate τAR ηCEG ηAR τCEG time units to CEG to explore i. The idea is to let AR and CEG run for equal portions of time when both are equally successful and to allocate more time to CEG when it outpaces AR in terms of the number of pruned assignments, and vice versa.
Open Source Code Yes Availability: The version of PAYNT used for the evaluation, as well as all benchmarks, are publicly available (Andriushchenko, Češka, Junges, Katoen, & Macák, 2024).
Open Datasets Yes We evaluate our approach on traditional CPOMDP benchmarks. We use single-objective POMDPs that are augmented with costs for low-reward actions taken from (Poupart et al., 2015). To showcase that our approach is competitive for Dec-POMDP solving, we compare PAYNT with Inf JESP on traditional Dec-POMDP benchmarks taken from http://masplan.org/problem_domains (Recycling, Grid3x3, Dec Tiger, Box Pushing) as well as on the already mentioned meeting-grid Dec-POMDP.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits or cross-validation details. It refers to 'traditional CPOMDP benchmarks' and 'traditional Dec-POMDP benchmarks' and states that 'The timeout for each experiment was 2 hours. We report the values for the policy that achieved the best average discounted reward (ADR) within this timeout.' without further specifying how the data within these benchmarks is split or used.
Hardware Specification Yes Hardware: All experiments are run on a single core of a machine equipped with an Intel i5-12600KF @4.9GHz CPU with 64GB RAM.
Software Dependencies No The paper mentions specific tools like Z3 and CVC5 in Section 5 ('SMT solver (we use Z3 (de Moura & Bjørner, 2008) or CVC5 (Barbosa et al., 2022))'), and PRISM in Section 6 and other sections ('PRISM guarded-command language (Kwiatkowska et al., 2011)'). However, it does not provide specific version numbers for these software components. The only version mentioned is 'PRISM 4.0' in the bibliography, but not in the context of the specific version used for the experiments.
Experiment Setup Yes The objective the specification Φ is to minimise the power consumption during the operation time while ensuring that the expected number of lost requests is at most one. The PM cannot fully observe the request queue. Instead, it can only distinguish four occupancy levels determined by the threshold parameters T1 {0, 0.1, 0.2, 0.3, 0.4}, T3 {0.6, 0.7, 0.8, 0.9}, and a fixed threshold T2=0.5. As the power consumption depends on the queue size, we aim to obtain the best power mode for each occupancy level. This is represented by parameter Pi (for occupancy level i) whose domain is {sleep, idle, active}. The maximal capacity QMAX ranges over {1, 2, . . . , 10}. ... Within about 3 minutes, our approach finds the following optimal solution: QMAX = 6, T1 = 0.1, T3 = 0.9, P1 = idle, P2=P3=P4=active. For fixed k, the goal ϕ0 is to find the k-FSC (finite state deterministic controller with k memory nodes; see the formal definition in App. B) that maximises the reachability probability to G. Additionally, we impose two antagonistic constraints: ϕ1 := the expected number of visits of the yellow cells with coins (denoted Ey) is at least 10, and ϕ2 := the expected number of visits of the "danger" red cells (denoted Er) is below 7. The timeout for the evaluation of the engines was 30 minutes. On model Pole, there is a maximising expected reward specification which is not supported by the CEG engine (and therefore by the Hybrid engine), so their times are missing. The timeout for the previous implementation of PAYNT was set to 2 hours.