Monte-Carlo Tree Search with Uncertainty Propagation via Optimal Transport

Authors: Tuan Quang Dam, Pascal Stenger, Lukas Schneider, Joni Pajarinen, Carlo D’Eramo, Odalric-Ambrym Maillard

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide theoretical guarantees of asymptotic convergence of O(n 1/2), with n as the number of visited trajectories, to the optimal policy and an empirical evaluation on several stochastic and partially observable environments, where our approach outperforms well-known related baselines.
Researcher Affiliation Academia 1Hanoi University of Science and Technology, Hanoi, Vietnam 2Department of Computer Science, Technical University of Darmstadt, Germany 3ETHZ ETH Zurich, Switzerland 4Department of Electrical Engineering and Automation, Aalto University, Finland 5Center for Artificial Intelligence and Data Science, University of W urzburg, Germany 6Hessian Center for Artificial Intelligence (Hessian.ai), Germany 7Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189CRISt AL, F-59000 Lille, France.
Pseudocode No The paper describes the steps and concepts of the Wasserstein Monte-Carlo Tree Search (W-MCTS) algorithm, its backup operator, and action selection strategies using descriptive text and mathematical formulations. However, it does not include a distinct section or figure explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code No Code for POMCP(UCT) (Silver & Veness, 2010b), D2NG (Bai et al., 2014), and DESPOT (Somani et al., 2013) is used as released by the original authors.1
Open Datasets Yes Frozen Lake. A 4 4 grid with slippery transitions, implemented in Open AI Gym (Brockman et al., 2016). The agent aims to reach a goal in the bottom-right corner. [...] We also test W-MCTS against POMCP(UCT), D2NG, and DESPOT in classic POMDP benchmarks: rocksample, pocman, Tag, and Laser Tag.
Dataset Splits No The environments used (Frozen Lake, NChain, River Swim, Six Arms, Taxi, rocksample, pocman, Tag, and Laser Tag) are reinforcement learning environments. The paper refers to 'averaged over 50 runs' and 'averaged over 1000 runs' to indicate experiment repetitions, not explicit training/validation/test dataset splits. There is no mention of static dataset partitioning.
Hardware Specification Yes All the experiments were done on an Intel(R) Core(TM) i9-14900K 3.20 GHz 24 cores/CPU.
Software Dependencies No For DNG, D2NG, we set hyperparameters as recommended in the paper and from the author s source code (Bai et al., 2013; 2014). We set exploration constant for UCT, Power-UCT to 2. We set initial standard deviation value to std = 30. The text mentions specific algorithms used as baselines, but does not provide specific version numbers for software libraries or dependencies used in their implementation.
Experiment Setup Yes The hyperparameters are tuned using grid-search. Except for the case of Pocman environment, we scale the rewards into the range [0, 1]. We use the discount factor γ = 0.95. For DNG, D2NG, we set hyperparameters as recommended in the paper and from the author s source code (Bai et al., 2013; 2014). We set exploration constant for UCT, Power-UCT to 2. We set initial standard deviation value to std = 30. In all Rocksample and Pocman environments, we set the heuristic for rollouts as treeknowledge = 0, rolloutknowledge = 1. For all environments, we increase the value of p and choose the best power mean p value for Power-UCT, and W-MCTS . Details can be found in Table 6.