Learning and Planning for Time-Varying MDPs Using Maximum Likelihood Estimation

Authors: Melkior Ornik, Ufuk Topcu

JMLR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the proposed methods on four numerical examples: a patrolling task with a change in system dynamics, a two-state MDP with periodically changing outcomes of actions, a wind flow estimation task, and a multi-armed bandit problem with periodically changing probabilities of different rewards. In this section, we illustrate the proposed CCMLE method on several numerical examples.
Researcher Affiliation Academia Melkior Ornik EMAIL Department of Aerospace Engineering and the Coordinated Science Laboratory University of Illinois at Urbana-Champaign Urbana, IL 61801, USA. Ufuk Topcu EMAIL Dept. of Aero. Eng. and Eng. Mechanics and the Oden Inst. for Computational Eng. and Sciences University of Texas at Austin Austin, TX 78712, USA.
Pseudocode No The paper describes methods and policies in prose, such as in Section 5.1 'Optimal Learning Policy' and Section 5.2 'Optimal Control Policy', but does not present any formal pseudocode blocks or algorithms with numbered steps or code-like formatting.
Open Source Code No The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository.
Open Datasets No The paper describes four numerical examples used for simulations: a patrolling task, a two-state MDP, a wind flow estimation task, and a multi-armed bandit problem. These are defined within the paper's context as simulated scenarios based on models and descriptions, rather than external, publicly available datasets for which access information is provided.
Dataset Splits No The paper describes simulation scenarios and numerical examples, such as a 'patrolling task' or a 'two-state MDP.' These are custom-defined environments for which the concept of explicit training/test/validation dataset splits is not applicable, and no such split information is provided.
Hardware Specification No The paper does not provide any specific details regarding the hardware (e.g., GPU models, CPU types, memory) used to run the experiments or simulations.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks with their specific versions) that would be needed to replicate the experiments.
Experiment Setup Yes In order to maximize its average collected reward, the agent applies the optimal policy given in (8), with the horizon length equal to 1, and recomputes and reapplies it at every time step. In other words, at every time the agent cares only about the results of its current and next step. An example of such behavior is exhibited in Figure 7, for β = 3/sqrt(2). Uncertainty weight β is set to 3/(2*sqrt(2)) a number chosen almost accidentally from previous versions of this experiment, thus without any particular tuning to this scenario.