Trust-Region Twisted Policy Improvement
Authors: Joery A. De Vries, Jinke He, Yaniv Oren, Matthijs T. J. Spaan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compared our TRT-SMC against the variational SMC method by Macfarlane et al. (2024), and the current strongest Monte-Carlo tree search (MCTS) method, Gumbel Alpha Zero by Danihelka et al. (2022). We performed experiments in the Brax continuous control tasks (Freeman et al., 2021) and Jumanji discrete environments (Bonnet et al., 2024), using the authors A2C and SAC results as baselines alongside our PPO implementation (Schulman et al., 2017). Although we compare sample-efficiency to the model-free baselines, this is only for reference since we do not account for the additional observed transitions by the planner in the main results. We show the evaluation curves for comparing sampleefficiency in Figure 2, where the planner-based methods used a budget of N = 16 transitions. For simplicity, we kept the depth m of the SMC planner uniform to the number of particles K, such that K = m = N. Our ablations in the subsection 4.2 also show that keeping m and K somewhat in tandem is ideal for SMC. |
| Researcher Affiliation | Academia | Joery A. de Vries 1 Jinke He 1 Yaniv Oren 1 Matthijs T. J. Spaan 1 1Delft University of Technology, Delft, the Netherlands. Correspondence to: Joery A. de Vries <J.A.de EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Bootstrapped Particle Filter for RL Algorithm 2 Bootstrapped Particle Filter for RL (Our TRT-SMC Pseudocode based on Algorithm 1). Algorithm 3 Outer EM-loop for Approximate Policy Iteration |
| Open Source Code | Yes | Our code can be found at https://github.com/joeryjoery/trtpi. |
| Open Datasets | Yes | We performed experiments in the Brax continuous control tasks (Freeman et al., 2021) and Jumanji discrete environments (Bonnet et al., 2024), using the authors A2C and SAC results as baselines alongside our PPO implementation (Schulman et al., 2017). We used the Jumanji 1.0.1 implementations of the Snake-v1 and Rubikscube-partly-scrambled-v0 environments (Bonnet et al., 2024), code is available at https://github.com/instadeepai/jumanji. For the Brax 0.10.5 implementation we used the Ant and Halfcheetah environments using the spring backend, code is available at https://github.com/google/brax. |
| Dataset Splits | No | The paper conducts experiments on reinforcement learning environments (Brax and Jumanji). In this context, agents interact with a simulator, generating data on-the-fly for training and evaluation. The concept of static "training/test/validation dataset splits" typical of supervised learning on fixed datasets is not directly applicable. The paper reports "average offline return over 128 episodes" for evaluation and refers to "Training Samples" in figures, which indicates performance over generated episodes rather than predefined static data partitions. |
| Hardware Specification | Yes | All experiments were run on a GPU cluster with a mix of NVIDIA Ge Force RTX 2080 TI 11GB, Tesla V100-SXM2 32GB, NVIDIA A40 48GB, and A100 80GB GPU cards (Delft AI Cluster (DAIC), 2024; Delft High Performance Computing Centre , DHPC). Each run (random seed/ repetition) required only a few CPU cores (2 logical cores) with a low memory budget (e.g., 4GB). For our most expensive singular experiments we found that we needed about 6GB of VRAM at most, and that the replay buffer size is the most important parameter in this regard. |
| Software Dependencies | Yes | Table 2. Software module versioning that we used for our experiments (also includes default parameter settings). Package Version brax 0.10.5 optax 0.2.3 flashbax 0.1.2 rlax 0.1.6 mctx 0.0.5 flax 0.8.4 jumanji 1.0.1 We implemented everything based on Jax 0.4.30 in Python 3.12.4. |
| Experiment Setup | Yes | The hyperparameters for our experiments are summarized in the following tables: Table 3: Shared parameters across experiments. Table 4: PPO-specific parameters. Table 5: Gumbel Monte-Carlo tree search experiment hyperparameters. Table 6: Shared Sequential Monte-Carlo hyperparameters. Table 7: Trust-Region Twisted Sequential Monte-Carlo hyperparameters. We underline all default values in bold, all other parameter values indicated in the sets were run in an exhaustive grid for the ablations. |