Model-Based Exploration in Monitored Markov Decision Processes
Authors: Alireza Kazemipour, Matthew E. Taylor, Michael Bowling
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we introduce a model-based algorithm for Mon-MDPs that addresses these shortcomings. The algorithm employs two instances of model-based interval estimation: one to ensure that observable rewards are reliably captured, and another to learn a minimax-optimal policy. Second, we empirically demonstrate the advantages. We show faster convergence than prior algorithms in more than four dozen benchmarks, and even more dramatic improvements when the monitoring process is known. Third, we present the first finite sample-bound on performance. We show convergence to a minimax-optimal policy even when some rewards are never observable. |
| Researcher Affiliation | Academia | 1Department of Computing Science, University of Alberta 2Alberta Machine Intelligence Institute (Amii) 3Canada CIFAR AI Chair, Amii. Correspondence to: Alireza Kazemipour <EMAIL>. |
| Pseudocode | Yes | D. Pseudocode Algorithm 1 Monitored MBIE-EB... Algorithm 2 Directed-E2 |
| Open Source Code | Yes | Code: https://github.com/IRLL/Exploration-in-Mon-MDPs. |
| Open Datasets | No | Figure 7: Full set of environments. Except Bottleneck, all environments are borrowed from Parisi et al. (2024a). Snakes and Wasps should be avoided. The goal is to STAY at the treasure chest. Gold bars are distractors. The agent gets stuck in the holes unless randomly gets pulled out. One-ways transition the agent in their own direction regardless of the action. ... The paper describes various environments (e.g., River Swim, Bottleneck) and refers to benchmarks from a prior work (Parisi et al., 2024a). While these environments are used for empirical evaluation, the paper does not provide concrete access information (e.g., direct links, DOIs) for these environments as explicit datasets for download. |
| Dataset Splits | No | Evaluation is based on discounted test return, averaged over 30 seeds with 95% confidence intervals. Every 100 steps, learning is paused, the agent is tested for 100 episodes, and the mean return is recorded. ... The paper describes an experimental methodology using '30 seeds' and '100 episodes' for testing, which are evaluation metrics, but it does not specify dataset splits (e.g., train/validation/test percentages or counts) in the context of traditional supervised learning. As this is a reinforcement learning paper, the concept of fixed dataset splits is not directly applicable in the same way. |
| Hardware Specification | Yes | We ran all the experiments on a SLURM-based cluster, using 32 Intel E5-2683 v4 Broadwell @ 2.1GHz CPUs. Thirty parallel runs took about an hour on a 32 core CPU. |
| Software Dependencies | No | The paper mentions computing KL-UCB using 'Newton's method' but does not specify any programming languages, libraries, or solvers with version numbers. |
| Experiment Setup | Yes | In practice, we slowly increase the confidence levels used in the bonuses over time. We follow the pattern of Lattimore & Szepesv ari (2020), with the confidence growing slightly faster than logarithmically7. The scale parameters β, βm, βe, βobs, βKL-UCB were tuned manually for each environment. For κ we also grow it slowly over time allowing the agent to interleave observe and optimize episodes: κ (k) = log k, where the log base was also tuned manually for each environment. Finally, rather than exactly solving the models every episode, we maintain two action-values: e Qobs and e Qopt, both initialized optimistically. we do 50 steps of synchronous value iteration before every episode to improve e Qopt, and before every observe episodes to improve e Qobs. ... Table 1: The set of hyperparameters. |