Monte-Carlo Tree Search with Uncertainty Propagation via Optimal Transport
Authors: Tuan Quang Dam, Pascal Stenger, Lukas Schneider, Joni Pajarinen, Carlo D’Eramo, Odalric-Ambrym Maillard
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide theoretical guarantees of asymptotic convergence of O(n 1/2), with n as the number of visited trajectories, to the optimal policy and an empirical evaluation on several stochastic and partially observable environments, where our approach outperforms well-known related baselines. |
| Researcher Affiliation | Academia | 1Hanoi University of Science and Technology, Hanoi, Vietnam 2Department of Computer Science, Technical University of Darmstadt, Germany 3ETHZ ETH Zurich, Switzerland 4Department of Electrical Engineering and Automation, Aalto University, Finland 5Center for Artificial Intelligence and Data Science, University of W urzburg, Germany 6Hessian Center for Artificial Intelligence (Hessian.ai), Germany 7Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189CRISt AL, F-59000 Lille, France. |
| Pseudocode | No | The paper describes the steps and concepts of the Wasserstein Monte-Carlo Tree Search (W-MCTS) algorithm, its backup operator, and action selection strategies using descriptive text and mathematical formulations. However, it does not include a distinct section or figure explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | Code for POMCP(UCT) (Silver & Veness, 2010b), D2NG (Bai et al., 2014), and DESPOT (Somani et al., 2013) is used as released by the original authors.1 |
| Open Datasets | Yes | Frozen Lake. A 4 4 grid with slippery transitions, implemented in Open AI Gym (Brockman et al., 2016). The agent aims to reach a goal in the bottom-right corner. [...] We also test W-MCTS against POMCP(UCT), D2NG, and DESPOT in classic POMDP benchmarks: rocksample, pocman, Tag, and Laser Tag. |
| Dataset Splits | No | The environments used (Frozen Lake, NChain, River Swim, Six Arms, Taxi, rocksample, pocman, Tag, and Laser Tag) are reinforcement learning environments. The paper refers to 'averaged over 50 runs' and 'averaged over 1000 runs' to indicate experiment repetitions, not explicit training/validation/test dataset splits. There is no mention of static dataset partitioning. |
| Hardware Specification | Yes | All the experiments were done on an Intel(R) Core(TM) i9-14900K 3.20 GHz 24 cores/CPU. |
| Software Dependencies | No | For DNG, D2NG, we set hyperparameters as recommended in the paper and from the author s source code (Bai et al., 2013; 2014). We set exploration constant for UCT, Power-UCT to 2. We set initial standard deviation value to std = 30. The text mentions specific algorithms used as baselines, but does not provide specific version numbers for software libraries or dependencies used in their implementation. |
| Experiment Setup | Yes | The hyperparameters are tuned using grid-search. Except for the case of Pocman environment, we scale the rewards into the range [0, 1]. We use the discount factor γ = 0.95. For DNG, D2NG, we set hyperparameters as recommended in the paper and from the author s source code (Bai et al., 2013; 2014). We set exploration constant for UCT, Power-UCT to 2. We set initial standard deviation value to std = 30. In all Rocksample and Pocman environments, we set the heuristic for rollouts as treeknowledge = 0, rolloutknowledge = 1. For all environments, we increase the value of p and choose the best power mean p value for Power-UCT, and W-MCTS . Details can be found in Table 6. |