Scaling Safe Policy Improvement: Monte Carlo Tree Search and Policy Iteration Strategies
Authors: Federico Bianchi, Alberto Castellini, Edoardo Zorzi, Thiago D. SimΓ£o, Matthijs T. J. Spaan, Alessandro Farinelli
JAIR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical analysis across four benchmark domains demonstrates that MCTS-SPIBB and SDP-SPIBB significantly enhance the scalability of safe policy improvement, providing robust and efficient algorithms for large-scale applications. These contributions represent a significant step towards the deployment of safe RL algorithms in complex real-world environments. 4 Experimental Evaluation In this section, we first report an overview of the experiments performed to evaluate MCTS-SPIBB and SDP-SPIBB. Then, we provide details on the benchmark domains, the procedure used to generate baseline policies, the overall experimental procedure and the baseline algorithms to which our methods have been compared. Finally, we analyze the results on the scalability and safety of the two proposed methods, showing that they scale to larger domains than baseline methods while keeping safety guarantees. An ablation study dissects the sensitivity of the proposed methods to key experimental factors, such as dataset diversity and the number of simulations. |
| Researcher Affiliation | Academia | FEDERICO BIANCHI , University of Verona, Italy ALBERTO CASTELLINI, University of Verona, Italy EDOARDO ZORZI, University of Verona, Italy THIAGO D. SIMΓO, Eindhoven University of Technology, Netherlands MATTHIJS T. J. SPAAN, Delft University of Technology, Netherlands ALESSANDRO FARINELLI, University of Verona, Italy |
| Pseudocode | Yes | Algorithm 1 I-SPIBB Input: baseline policy π0, transition matrix πD, reward matrix π , discount factor πΎ, stopping parameter π Output: an improved policy π 1: π= π0 // Initialization to the baseline policy Algorithm 2 Policy Improvement Input: last iteration of π, baseline policy π0, state π Output: policy improvement ππΌ(π ) 1: Initialize ππΌ(π ) = 0 Algorithm 3 DP-SPIBB Input: baseline policy π0, transition matrix πD, discount factor πΎ, stopping parameter π Output: an improved policy π 1: π= π0 // Initialization of πto π0 Algorithm 4 MCTS-SPIBB Input: π : current state; π0: baseline policy; π : minimum count; πD: counter; π: total number of simulations; Bπ΄(π ), Bπ΄(π ): bootstrapped/non-bootstrapped actions 1: Tr {} // Empty MC tree Algorithm 5 Simulate Input: Tr: MC tree structure; π : state node; π: current depth; π0: baseline policy; Bπ΄(π ), Bπ΄(π ): bootstrapped/nonbootstrapped actions 1: if πΎπ< πthen Algorithm 6 Select Action Input: π : state node; Bπ΄(π ), Bπ΄(π ): bootstrapped/non-bootstrapped action sets; π0: baseline policy; ππ B: total probability of bootstrapped actions; ππππ: rollout flag 1: π U([0, 1]) // Uniform sampling from [0, 1] Algorithm 7 Rollout Input: π : state node; π: current depth; π0: baseline policy; Bπ΄(π ): bootstrapped action set; ππ B: total probability of bootstrapped actions if πΎπ< πthen |
| Open Source Code | Yes | All tests can be reproduced by the Python code available in the repository connected to this paper2 https://github.com/Isla-lab/scalable_SPI |
| Open Datasets | Yes | 4.2 Domains In the following, all the domains used in the experimental evaluation are listed and described in detail. Grid World. In Grid World (Laroche et al. 2019) an agent moves in a π πgrid starting from the bottom-left corner and aiming to reach the top-right corner. Sys Admin. In Sys Admin (Guestrin et al. 2003) an agent has to administer a network of πmachines. Wet Chicken. In Wet Chicken(Scholl et al. 2022a) an agent floats in a small boat on a river with a waterfall at one end. Multi-agent Sys Admin. In Multi-agent Sys Admin (Guestrin et al. 2003), each agent controls a single machine in a network of interconnected machines. |
| Dataset Splits | No | The paper describes generating datasets of a certain number of 'trajectories' for experiments, but it does not specify how these datasets are further split into training, validation, or test sets. For example, 'For each dataset size |D|, we generate ND datasets, each containing |D| trajectories.' does not describe a split methodology for machine learning tasks. |
| Hardware Specification | Yes | D Experimental Setup The experiments are conducted on a workstation equipped with a 13th Gen Intel Core i7-13700K CPU with 24 threads and 64.0 GB of RAM, running Ubuntu 22.04.5 LTS 64-bit as operating system. |
| Software Dependencies | No | All code was implemented in Python, leveraging standard scientific computing libraries. The paper does not specify the version of Python used or the specific versions of the 'standard scientific computing libraries'. |
| Experiment Setup | Yes | Table 2. Hyperparameters of the experiments on scalability described in Section 4.6. Experiment Scalability (see Figure 2) Domain single-agent Sys Admin with 4, 7, 10, 12, 13, 20, 35 machines Number of datasets 20 |D| 5000 N 5 Measures π(ππΌ, π ), time [s] Number of runs 20 runs of 15 steps each Number of MCTS simulations 4000 Baseline policy sub-optimal (π= 0.7) Table 3. Hyperparameters of the experiments on Safety described in Section 4.7. Experiment Safety (see Figure 5) Domain Wet-Chicken Number of datasets 20 |D| 100, 200, 500, 1000, 2000, 5000 N 5 Measures π(ππΌ, π ), 15% CVa R π(ππΌ, π ) Number of runs policy evaluation Number of MCTS simulations 10000 Baseline policy sub-optimal (π= 0.7) PPO Learning rate [0.00001, 0.00003, 0.0001, 0.0003] Entropy coefficient [0.0, 0.005, 0.01] Clip range [0.1, 0.2, 0.3] Batch size [64, 128, 256] Number of epochs [4, 10] DQN Learning rate [0.00001, 0.00003, 0.0001, 0.0003] Exploration fraction [0.1, 0.3] Batch size [64, 128, 256] Buffer size 500000 Target update interval [100, 500] CQL Learning rate [0.00001, 0.00003, 0.0001, 0.0003] Alpha [1, 5, 10] Batch size [64, 128, 256] Q updated per step [200, 500] Target update interval [100, 500] Decision Transformer Context length [1, 10, 20, 80, 100] Hidden size [56, 64, 128, 256, 512] Number of transformer layers [1, 2, 4, 8] Number of attention heads [2, 4, 8, 10] Dropout rate [0.0, 0.001, 0.01, 0.1] Learning rate [0.00001, 0.0001, 0.001, 0.01] Batch size [64, 128, 256] Number of epochs [1, 3, 10, 20, 30, 50, 80, 100] MORe L Ensemble hidden sizes [64, 128, 256, 512] Ensemble learning rates [0.0001, 0.001, 0.01] Ensemble batch sizes [64, 128, 256] Ensemble epochs list [5, 10, 20, 50] Ensemble sizes [2, 4, 8] PPO learning rate [0.00001, 0.00003, 0.0001, 0.0003] PPO entropy coefficient [0.0, 0.005, 0.01] PPO clip range [0.1, 0.2, 0.3] PPO batch size [64, 128, 256] PPO number of epochs [4, 10] |