Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation

Authors: Eliot Xing, Vernon Luk, Jean Oh

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We re-implement challenging manipulation and locomotion tasks in Rewarped, and show that SAPO outperforms baselines over a range of tasks that involve interaction between rigid bodies, articulations, and deformables. [...] We evaluate our proposed maximum entropy FO-MBRL algorithm, Soft Analytic Policy Optimization (SAPO, Section 4), against baselines on a range of locomotion and manipulation tasks involving rigid and soft bodies. [...] In Figure 2, we visualize training curves to compare algorithms. SAPO shows better training stability across different random seeds, against existing FO-MBRL algorithms APG and SHAC. In Table 2, we report evaluation performance for final policies after training.
Researcher Affiliation Academia Eliot Xing & Vernon Luk & Jean Oh Carnegie Mellon University EMAIL
Pseudocode Yes Pseudocode for SAPO is shown in Appendix B.2, and the computational graph of SAPO is illustrated in Appendix Figure 4. [...] Algorithm 1: Soft Analytic Policy Optimization (SAPO)
Open Source Code No Additional details at rewarped.github.io. (This link is to a project website and does not explicitly state it contains source code, nor is it a direct link to a code repository.)
Open Datasets No The paper describes reimplemented tasks (e.g., "Ant Run Ant locomotion task from DFlex", "Rolling Flat Rolling pin manipulation task from Plasticine Lab"), which are simulation environments or benchmarks, not publicly available datasets with specific access information (links, DOIs, or citations with authors/year) in the main text.
Dataset Splits No The paper conducts experiments in simulated environments for reinforcement learning, which typically do not involve static training/testing/validation dataset splits in the same way as supervised learning. The text mentions "Mean and 95% CIs over 10 random seeds with 2N episodes per seed for N = 32 or 64 parallel envs," which refers to experimental repetitions and parallel execution, not dataset splitting.
Hardware Specification Yes We run all algorithms on consumer workstations with NVIDIA RTX 4090 GPUs. Each run uses a single GPU, on which we run both the GPU-accelerated parallel simulation and optimization loop. [...] We report all timings on a consumer workstation with an AMD Threadripper 5955WX CPU, NVIDIA RTX 4090 GPU, and 128GB DDR4 3200MHz RAM.
Software Dependencies No We build Rewarped on NVIDIA Warp (Macklin, 2022) [...] We use a custom Py Torch autograd function to interface simulation data and model parameters between Warp and Py Torch [...]. (The paper mentions NVIDIA Warp and PyTorch but does not provide specific version numbers for these software components, which is required for a reproducible description.)
Experiment Setup Yes Implementation details (network architecture, common hyperparameters, etc.) are standardized between methods for fair comparison, see Appendix C. [...] Appendix C: HYPERPARAMETERS. Table 4: Shared hyperparameters. Algorithms use hyperparameter settings in the shared column unless otherwise specified in an individual column. (This table lists specific values for Num envs, Batch size, Horizon, Mini-epochs, Discount, TD/GAE lambda, learning rates, optimizer types, beta values, gradient clip, norm type, activation type, actor sigma, num critics, critic tau, replay buffer size, target entropy, and initial temperature.)