reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning and Planning for Time-Varying MDPs Using Maximum Likelihood Estimation

Authors: Melkior Ornik, Ufuk Topcu

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the proposed methods on four numerical examples: a patrolling task with a change in system dynamics, a two-state MDP with periodically changing outcomes of actions, a wind ﬂow estimation task, and a multi-armed bandit problem with periodically changing probabilities of diﬀerent rewards. In this section, we illustrate the proposed CCMLE method on several numerical examples.
Researcher Affiliation	Academia	Melkior Ornik EMAIL Department of Aerospace Engineering and the Coordinated Science Laboratory University of Illinois at Urbana-Champaign Urbana, IL 61801, USA. Ufuk Topcu EMAIL Dept. of Aero. Eng. and Eng. Mechanics and the Oden Inst. for Computational Eng. and Sciences University of Texas at Austin Austin, TX 78712, USA.
Pseudocode	No	The paper describes methods and policies in prose, such as in Section 5.1 'Optimal Learning Policy' and Section 5.2 'Optimal Control Policy', but does not present any formal pseudocode blocks or algorithms with numbered steps or code-like formatting.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository.
Open Datasets	No	The paper describes four numerical examples used for simulations: a patrolling task, a two-state MDP, a wind flow estimation task, and a multi-armed bandit problem. These are defined within the paper's context as simulated scenarios based on models and descriptions, rather than external, publicly available datasets for which access information is provided.
Dataset Splits	No	The paper describes simulation scenarios and numerical examples, such as a 'patrolling task' or a 'two-state MDP.' These are custom-defined environments for which the concept of explicit training/test/validation dataset splits is not applicable, and no such split information is provided.
Hardware Specification	No	The paper does not provide any specific details regarding the hardware (e.g., GPU models, CPU types, memory) used to run the experiments or simulations.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks with their specific versions) that would be needed to replicate the experiments.
Experiment Setup	Yes	In order to maximize its average collected reward, the agent applies the optimal policy given in (8), with the horizon length equal to 1, and recomputes and reapplies it at every time step. In other words, at every time the agent cares only about the results of its current and next step. An example of such behavior is exhibited in Figure 7, for β = 3/sqrt(2). Uncertainty weight β is set to 3/(2*sqrt(2)) a number chosen almost accidentally from previous versions of this experiment, thus without any particular tuning to this scenario.