Scaling Safe Policy Improvement: Monte Carlo Tree Search and Policy Iteration Strategies

Authors: Federico Bianchi, Alberto Castellini, Edoardo Zorzi, Thiago D. SimΓ£o, Matthijs T. J. Spaan, Alessandro Farinelli

JAIR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical analysis across four benchmark domains demonstrates that MCTS-SPIBB and SDP-SPIBB significantly enhance the scalability of safe policy improvement, providing robust and efficient algorithms for large-scale applications. These contributions represent a significant step towards the deployment of safe RL algorithms in complex real-world environments. 4 Experimental Evaluation In this section, we first report an overview of the experiments performed to evaluate MCTS-SPIBB and SDP-SPIBB. Then, we provide details on the benchmark domains, the procedure used to generate baseline policies, the overall experimental procedure and the baseline algorithms to which our methods have been compared. Finally, we analyze the results on the scalability and safety of the two proposed methods, showing that they scale to larger domains than baseline methods while keeping safety guarantees. An ablation study dissects the sensitivity of the proposed methods to key experimental factors, such as dataset diversity and the number of simulations.
Researcher Affiliation Academia FEDERICO BIANCHI , University of Verona, Italy ALBERTO CASTELLINI, University of Verona, Italy EDOARDO ZORZI, University of Verona, Italy THIAGO D. SIMÃO, Eindhoven University of Technology, Netherlands MATTHIJS T. J. SPAAN, Delft University of Technology, Netherlands ALESSANDRO FARINELLI, University of Verona, Italy
Pseudocode Yes Algorithm 1 I-SPIBB Input: baseline policy πœ‹0, transition matrix 𝑇D, reward matrix 𝑅, discount factor 𝛾, stopping parameter πœ– Output: an improved policy πœ‹ 1: πœ‹= πœ‹0 // Initialization to the baseline policy Algorithm 2 Policy Improvement Input: last iteration of 𝑄, baseline policy πœ‹0, state 𝑠 Output: policy improvement πœ‹πΌ(𝑠) 1: Initialize πœ‹πΌ(𝑠) = 0 Algorithm 3 DP-SPIBB Input: baseline policy πœ‹0, transition matrix 𝑇D, discount factor 𝛾, stopping parameter πœ– Output: an improved policy πœ‹ 1: πœ‹= πœ‹0 // Initialization of πœ‹to πœ‹0 Algorithm 4 MCTS-SPIBB Input: 𝑠: current state; πœ‹0: baseline policy; 𝑁 : minimum count; 𝑁D: counter; π‘š: total number of simulations; B𝐴(𝑠), B𝐴(𝑠): bootstrapped/non-bootstrapped actions 1: Tr {} // Empty MC tree Algorithm 5 Simulate Input: Tr: MC tree structure; 𝑠: state node; 𝑑: current depth; πœ‹0: baseline policy; B𝐴(𝑠), B𝐴(𝑠): bootstrapped/nonbootstrapped actions 1: if 𝛾𝑑< πœ€then Algorithm 6 Select Action Input: 𝑠: state node; B𝐴(𝑠), B𝐴(𝑠): bootstrapped/non-bootstrapped action sets; πœ‹0: baseline policy; 𝑝𝑠 B: total probability of bootstrapped actions; π‘Ÿπ‘œπ‘™π‘™: rollout flag 1: πœƒ U([0, 1]) // Uniform sampling from [0, 1] Algorithm 7 Rollout Input: 𝑠: state node; 𝑑: current depth; πœ‹0: baseline policy; B𝐴(𝑠): bootstrapped action set; 𝑝𝑠 B: total probability of bootstrapped actions if 𝛾𝑑< πœ€then
Open Source Code Yes All tests can be reproduced by the Python code available in the repository connected to this paper2 https://github.com/Isla-lab/scalable_SPI
Open Datasets Yes 4.2 Domains In the following, all the domains used in the experimental evaluation are listed and described in detail. Grid World. In Grid World (Laroche et al. 2019) an agent moves in a 𝑁 𝑁grid starting from the bottom-left corner and aiming to reach the top-right corner. Sys Admin. In Sys Admin (Guestrin et al. 2003) an agent has to administer a network of 𝑁machines. Wet Chicken. In Wet Chicken(Scholl et al. 2022a) an agent floats in a small boat on a river with a waterfall at one end. Multi-agent Sys Admin. In Multi-agent Sys Admin (Guestrin et al. 2003), each agent controls a single machine in a network of interconnected machines.
Dataset Splits No The paper describes generating datasets of a certain number of 'trajectories' for experiments, but it does not specify how these datasets are further split into training, validation, or test sets. For example, 'For each dataset size |D|, we generate ND datasets, each containing |D| trajectories.' does not describe a split methodology for machine learning tasks.
Hardware Specification Yes D Experimental Setup The experiments are conducted on a workstation equipped with a 13th Gen Intel Core i7-13700K CPU with 24 threads and 64.0 GB of RAM, running Ubuntu 22.04.5 LTS 64-bit as operating system.
Software Dependencies No All code was implemented in Python, leveraging standard scientific computing libraries. The paper does not specify the version of Python used or the specific versions of the 'standard scientific computing libraries'.
Experiment Setup Yes Table 2. Hyperparameters of the experiments on scalability described in Section 4.6. Experiment Scalability (see Figure 2) Domain single-agent Sys Admin with 4, 7, 10, 12, 13, 20, 35 machines Number of datasets 20 |D| 5000 N 5 Measures 𝜌(πœ‹πΌ, 𝑀 ), time [s] Number of runs 20 runs of 15 steps each Number of MCTS simulations 4000 Baseline policy sub-optimal (𝑝= 0.7) Table 3. Hyperparameters of the experiments on Safety described in Section 4.7. Experiment Safety (see Figure 5) Domain Wet-Chicken Number of datasets 20 |D| 100, 200, 500, 1000, 2000, 5000 N 5 Measures 𝜌(πœ‹πΌ, 𝑀 ), 15% CVa R 𝜌(πœ‹πΌ, 𝑀 ) Number of runs policy evaluation Number of MCTS simulations 10000 Baseline policy sub-optimal (𝑝= 0.7) PPO Learning rate [0.00001, 0.00003, 0.0001, 0.0003] Entropy coefficient [0.0, 0.005, 0.01] Clip range [0.1, 0.2, 0.3] Batch size [64, 128, 256] Number of epochs [4, 10] DQN Learning rate [0.00001, 0.00003, 0.0001, 0.0003] Exploration fraction [0.1, 0.3] Batch size [64, 128, 256] Buffer size 500000 Target update interval [100, 500] CQL Learning rate [0.00001, 0.00003, 0.0001, 0.0003] Alpha [1, 5, 10] Batch size [64, 128, 256] Q updated per step [200, 500] Target update interval [100, 500] Decision Transformer Context length [1, 10, 20, 80, 100] Hidden size [56, 64, 128, 256, 512] Number of transformer layers [1, 2, 4, 8] Number of attention heads [2, 4, 8, 10] Dropout rate [0.0, 0.001, 0.01, 0.1] Learning rate [0.00001, 0.0001, 0.001, 0.01] Batch size [64, 128, 256] Number of epochs [1, 3, 10, 20, 30, 50, 80, 100] MORe L Ensemble hidden sizes [64, 128, 256, 512] Ensemble learning rates [0.0001, 0.001, 0.01] Ensemble batch sizes [64, 128, 256] Ensemble epochs list [5, 10, 20, 50] Ensemble sizes [2, 4, 8] PPO learning rate [0.00001, 0.00003, 0.0001, 0.0003] PPO entropy coefficient [0.0, 0.005, 0.01] PPO clip range [0.1, 0.2, 0.3] PPO batch size [64, 128, 256] PPO number of epochs [4, 10]