Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation
Authors: Eliot Xing, Vernon Luk, Jean Oh
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We re-implement challenging manipulation and locomotion tasks in Rewarped, and show that SAPO outperforms baselines over a range of tasks that involve interaction between rigid bodies, articulations, and deformables. [...] We evaluate our proposed maximum entropy FO-MBRL algorithm, Soft Analytic Policy Optimization (SAPO, Section 4), against baselines on a range of locomotion and manipulation tasks involving rigid and soft bodies. [...] In Figure 2, we visualize training curves to compare algorithms. SAPO shows better training stability across different random seeds, against existing FO-MBRL algorithms APG and SHAC. In Table 2, we report evaluation performance for final policies after training. |
| Researcher Affiliation | Academia | Eliot Xing & Vernon Luk & Jean Oh Carnegie Mellon University EMAIL |
| Pseudocode | Yes | Pseudocode for SAPO is shown in Appendix B.2, and the computational graph of SAPO is illustrated in Appendix Figure 4. [...] Algorithm 1: Soft Analytic Policy Optimization (SAPO) |
| Open Source Code | No | Additional details at rewarped.github.io. (This link is to a project website and does not explicitly state it contains source code, nor is it a direct link to a code repository.) |
| Open Datasets | No | The paper describes reimplemented tasks (e.g., "Ant Run Ant locomotion task from DFlex", "Rolling Flat Rolling pin manipulation task from Plasticine Lab"), which are simulation environments or benchmarks, not publicly available datasets with specific access information (links, DOIs, or citations with authors/year) in the main text. |
| Dataset Splits | No | The paper conducts experiments in simulated environments for reinforcement learning, which typically do not involve static training/testing/validation dataset splits in the same way as supervised learning. The text mentions "Mean and 95% CIs over 10 random seeds with 2N episodes per seed for N = 32 or 64 parallel envs," which refers to experimental repetitions and parallel execution, not dataset splitting. |
| Hardware Specification | Yes | We run all algorithms on consumer workstations with NVIDIA RTX 4090 GPUs. Each run uses a single GPU, on which we run both the GPU-accelerated parallel simulation and optimization loop. [...] We report all timings on a consumer workstation with an AMD Threadripper 5955WX CPU, NVIDIA RTX 4090 GPU, and 128GB DDR4 3200MHz RAM. |
| Software Dependencies | No | We build Rewarped on NVIDIA Warp (Macklin, 2022) [...] We use a custom Py Torch autograd function to interface simulation data and model parameters between Warp and Py Torch [...]. (The paper mentions NVIDIA Warp and PyTorch but does not provide specific version numbers for these software components, which is required for a reproducible description.) |
| Experiment Setup | Yes | Implementation details (network architecture, common hyperparameters, etc.) are standardized between methods for fair comparison, see Appendix C. [...] Appendix C: HYPERPARAMETERS. Table 4: Shared hyperparameters. Algorithms use hyperparameter settings in the shared column unless otherwise specified in an individual column. (This table lists specific values for Num envs, Batch size, Horizon, Mini-epochs, Discount, TD/GAE lambda, learning rates, optimizer types, beta values, gradient clip, norm type, activation type, actor sigma, num critics, critic tau, replay buffer size, target entropy, and initial temperature.) |