reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Online Planning in MDPs with Stochastic Durative Actions

Authors: Tal Berman, Ronen I. Brafman, Erez Karpas

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Overall, we present the ﬁrst online planner for stochastic temporal planning, solving a richer problem representation than previous work while achieving state-of-the-art empirical results. (D) A comprehensive empirical evaluation on a problem set larger than prior work, including existing and novel domains, demonstrating improved scalability and scope. Code, domains, additional results and discussion can be found at https://github.com/tali Berman5/TP MCTS. We conducted experiments with two possible decision-time budgets: 1 and 10 seconds and domain instances of different sizes. All algorithms were implemented in Python. Experiments were run on an an AMD EPYC 7702P 64-Core Processor. Each experiment was repeated 100 times. Tables 1 and 2 show the success rate of each planner on 100 trials for 1 and 10 seconds per decision.
Researcher Affiliation	Academia	1Ben-Gurion University 2Technion EMAIL,EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 TP-MCTS (root-interval version)
Open Source Code	Yes	Code, domains, additional results and discussion can be found at https://github.com/tali Berman5/TP MCTS.
Open Datasets	Yes	Our comparison covers more domains than previous work. We consider ﬁve structured domains that model realworld problems: NASA Rover and Machine Shop, as used by [Mausam and Weld, 2008], Stuck-Car, a new domain with more interesting stochastic effects, and two new domains with required concurrency: Hosting and Probabilistic Match Cellar. We also use three synthetic domains to study basic properties of the algorithms. We conducted experiments with two possible decision-time budgets: 1 and 10 seconds and domain instances of different sizes. All algorithms were implemented in Python. Experiments were run on an AMD EPYC 7702P 64-Core Processor. Each experiment was repeated 100 times. Our GIT contains additional results. The following general trends emerge: (1) MW-MCTS almost always dominates MW-RTDP. RTDP is better only in the smaller Simple domain, where, due to action noninteraction, it can update the value function quickly. (2) TPMCTS dominates in almost all domains. There are two exceptions: In Nasa Rover(1), MW beneﬁts from its shorter solution depth given 1sec. This advantage disappears in the larger Nasa Rover(2). MW-RTDP has a higher success rate in Prob-Conc+8 given 10sec. Given the performance on other Prob-Conc variants, this may be due to the variance in the success estimates. The impact of domain size on the algorithms is clearly visible in the three structured domains. (3) Additional search time almost always leads to increased success rates for all algorithms, but trends (1)-(2) hold in both tables. (4) Hosting-2 and Prob MC(i) are not solvable by MW s method due to the need for required concurrency. Simple-x highlight the difﬁculty MW s algorithm has scaling up as the number of non-mutex actions increases due to the exponential growth in the number of legal action combinations. Although the solution depth is smaller, the algorithm does not have sufﬁcient time to explore all actions and has a low chance of detecting the solution. With more time, it slightly scales up. TP-MCTS requires deeper solutions, but MCTS combined with PTRPG can focus exploration on more promising paths. Similar behavior is observed in Nasa Rover(2), Machine Shop(2), and Machine Shop(3). Comparing earliest and root-interval, earliest performs better given less time when temporal constraints are not complex because it can expand more nodes. Thus, trying to start an action as soon as possible is often a good heuristic. However, with sufﬁcient time, root-interval succeeds more often, and in domains with more complex temporal dependencies, such as Hosting, only it can solve the problem reliably. Additional experiments (see GIT) show that: (1) When deadlines are relaxed, success rates increase, but makespan is rarely impacted. (2) PTRPG s evaluation formula can have a signiﬁcant impact on success rates, Nevertheless, but the relative strength of all variants remains similar, with TP-MCTS remaining much stronger than the MW variants.
Dataset Splits	No	The paper mentions experiments conducted on
Hardware Specification	Yes	Experiments were run on an AMD EPYC 7702P 64-Core Processor.
Software Dependencies	No	The paper states, "All algorithms were implemented in Python." and "Our TP-MCTS implementation supports domain representations written in the Uniﬁed Planning Framework [Kapellos et al., 2023]." However, no specific version numbers are provided for Python or the Uniﬁed Planning Framework.
Experiment Setup	Yes	We conducted experiments with two possible decision-time budgets: 1 and 10 seconds and domain instances of different sizes. All algorithms were implemented in Python. Experiments were run on an AMD EPYC 7702P 64-Core Processor. Each experiment was repeated 100 times. Deadlines were set to the length of an optimal plan + time for one action failure, so as to be challenging.