reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Safe Policy Improvement: Monte Carlo Tree Search and Policy Iteration Strategies

Authors: Federico Bianchi, Alberto Castellini, Edoardo Zorzi, Thiago D. Simão, Matthijs T. J. Spaan, Alessandro Farinelli

JAIR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical analysis across four benchmark domains demonstrates that MCTS-SPIBB and SDP-SPIBB significantly enhance the scalability of safe policy improvement, providing robust and efficient algorithms for large-scale applications. These contributions represent a significant step towards the deployment of safe RL algorithms in complex real-world environments. 4 Experimental Evaluation In this section, we first report an overview of the experiments performed to evaluate MCTS-SPIBB and SDP-SPIBB. Then, we provide details on the benchmark domains, the procedure used to generate baseline policies, the overall experimental procedure and the baseline algorithms to which our methods have been compared. Finally, we analyze the results on the scalability and safety of the two proposed methods, showing that they scale to larger domains than baseline methods while keeping safety guarantees. An ablation study dissects the sensitivity of the proposed methods to key experimental factors, such as dataset diversity and the number of simulations.
Researcher Affiliation	Academia	FEDERICO BIANCHI , University of Verona, Italy ALBERTO CASTELLINI, University of Verona, Italy EDOARDO ZORZI, University of Verona, Italy THIAGO D. SIMÃO, Eindhoven University of Technology, Netherlands MATTHIJS T. J. SPAAN, Delft University of Technology, Netherlands ALESSANDRO FARINELLI, University of Verona, Italy
Pseudocode	Yes	Algorithm 1 I-SPIBB Input: baseline policy 𝜋0, transition matrix 𝑇D, reward matrix 𝑅, discount factor 𝛾, stopping parameter 𝜖 Output: an improved policy 𝜋 1: 𝜋= 𝜋0 // Initialization to the baseline policy Algorithm 2 Policy Improvement Input: last iteration of 𝑄, baseline policy 𝜋0, state 𝑠 Output: policy improvement 𝜋𝐼(𝑠) 1: Initialize 𝜋𝐼(𝑠) = 0 Algorithm 3 DP-SPIBB Input: baseline policy 𝜋0, transition matrix 𝑇D, discount factor 𝛾, stopping parameter 𝜖 Output: an improved policy 𝜋 1: 𝜋= 𝜋0 // Initialization of 𝜋to 𝜋0 Algorithm 4 MCTS-SPIBB Input: 𝑠: current state; 𝜋0: baseline policy; 𝑁 : minimum count; 𝑁D: counter; 𝑚: total number of simulations; B𝐴(𝑠), B𝐴(𝑠): bootstrapped/non-bootstrapped actions 1: Tr {} // Empty MC tree Algorithm 5 Simulate Input: Tr: MC tree structure; 𝑠: state node; 𝑑: current depth; 𝜋0: baseline policy; B𝐴(𝑠), B𝐴(𝑠): bootstrapped/nonbootstrapped actions 1: if 𝛾𝑑< 𝜀then Algorithm 6 Select Action Input: 𝑠: state node; B𝐴(𝑠), B𝐴(𝑠): bootstrapped/non-bootstrapped action sets; 𝜋0: baseline policy; 𝑝𝑠 B: total probability of bootstrapped actions; 𝑟𝑜𝑙𝑙: rollout flag 1: 𝜃 U([0, 1]) // Uniform sampling from [0, 1] Algorithm 7 Rollout Input: 𝑠: state node; 𝑑: current depth; 𝜋0: baseline policy; B𝐴(𝑠): bootstrapped action set; 𝑝𝑠 B: total probability of bootstrapped actions if 𝛾𝑑< 𝜀then
Open Source Code	Yes	All tests can be reproduced by the Python code available in the repository connected to this paper2 https://github.com/Isla-lab/scalable_SPI
Open Datasets	Yes	4.2 Domains In the following, all the domains used in the experimental evaluation are listed and described in detail. Grid World. In Grid World (Laroche et al. 2019) an agent moves in a 𝑁 𝑁grid starting from the bottom-left corner and aiming to reach the top-right corner. Sys Admin. In Sys Admin (Guestrin et al. 2003) an agent has to administer a network of 𝑁machines. Wet Chicken. In Wet Chicken(Scholl et al. 2022a) an agent floats in a small boat on a river with a waterfall at one end. Multi-agent Sys Admin. In Multi-agent Sys Admin (Guestrin et al. 2003), each agent controls a single machine in a network of interconnected machines.
Dataset Splits	No	The paper describes generating datasets of a certain number of 'trajectories' for experiments, but it does not specify how these datasets are further split into training, validation, or test sets. For example, 'For each dataset size \|D\|, we generate ND datasets, each containing \|D\| trajectories.' does not describe a split methodology for machine learning tasks.
Hardware Specification	Yes	D Experimental Setup The experiments are conducted on a workstation equipped with a 13th Gen Intel Core i7-13700K CPU with 24 threads and 64.0 GB of RAM, running Ubuntu 22.04.5 LTS 64-bit as operating system.
Software Dependencies	No	All code was implemented in Python, leveraging standard scientific computing libraries. The paper does not specify the version of Python used or the specific versions of the 'standard scientific computing libraries'.
Experiment Setup	Yes	Table 2. Hyperparameters of the experiments on scalability described in Section 4.6. Experiment Scalability (see Figure 2) Domain single-agent Sys Admin with 4, 7, 10, 12, 13, 20, 35 machines Number of datasets 20 \|D\| 5000 N 5 Measures 𝜌(𝜋𝐼, 𝑀 ), time [s] Number of runs 20 runs of 15 steps each Number of MCTS simulations 4000 Baseline policy sub-optimal (𝑝= 0.7) Table 3. Hyperparameters of the experiments on Safety described in Section 4.7. Experiment Safety (see Figure 5) Domain Wet-Chicken Number of datasets 20 \|D\| 100, 200, 500, 1000, 2000, 5000 N 5 Measures 𝜌(𝜋𝐼, 𝑀 ), 15% CVa R 𝜌(𝜋𝐼, 𝑀 ) Number of runs policy evaluation Number of MCTS simulations 10000 Baseline policy sub-optimal (𝑝= 0.7) PPO Learning rate [0.00001, 0.00003, 0.0001, 0.0003] Entropy coefficient [0.0, 0.005, 0.01] Clip range [0.1, 0.2, 0.3] Batch size [64, 128, 256] Number of epochs [4, 10] DQN Learning rate [0.00001, 0.00003, 0.0001, 0.0003] Exploration fraction [0.1, 0.3] Batch size [64, 128, 256] Buffer size 500000 Target update interval [100, 500] CQL Learning rate [0.00001, 0.00003, 0.0001, 0.0003] Alpha [1, 5, 10] Batch size [64, 128, 256] Q updated per step [200, 500] Target update interval [100, 500] Decision Transformer Context length [1, 10, 20, 80, 100] Hidden size [56, 64, 128, 256, 512] Number of transformer layers [1, 2, 4, 8] Number of attention heads [2, 4, 8, 10] Dropout rate [0.0, 0.001, 0.01, 0.1] Learning rate [0.00001, 0.0001, 0.001, 0.01] Batch size [64, 128, 256] Number of epochs [1, 3, 10, 20, 30, 50, 80, 100] MORe L Ensemble hidden sizes [64, 128, 256, 512] Ensemble learning rates [0.0001, 0.001, 0.01] Ensemble batch sizes [64, 128, 256] Ensemble epochs list [5, 10, 20, 50] Ensemble sizes [2, 4, 8] PPO learning rate [0.00001, 0.00003, 0.0001, 0.0003] PPO entropy coefficient [0.0, 0.005, 0.01] PPO clip range [0.1, 0.2, 0.3] PPO batch size [64, 128, 256] PPO number of epochs [4, 10]