ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization
Authors: Chen Bo Calvin Zhang, Zhang-Wei Hong, Aldo Pacchiano, Pulkit Agrawal
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate ORSO s effectiveness across various continuous control tasks. Compared to prior approaches, ORSO significantly reduces the amount of data required to evaluate a shaping reward function, resulting in superior data efficiency and a significant reduction in computational time (up to 8 ). ORSO consistently identifies high-quality reward functions outperforming prior methods by more than 50% and on average identifies policies as performant as the ones learned using manually engineered reward functions by domain experts. 5 PRACTICAL IMPLEMENTATION AND EXPERIMENTAL RESULTS In this section, we present a practical implementation2 of ORSO and its experimental results on several continuous control tasks. |
| Researcher Affiliation | Academia | Chen Bo Calvin Zhang , Zhang-Wei Hong , Aldo Pacchiano , Pulkit Agrawal Improbable AI Lab, Massachusetts Institute of Technology ETH Zurich Boston University Broad Institute of MIT and Harvard Correspondence to EMAIL, EMAIL. |
| Pseudocode | Yes | Algorithm 1 ORSO: Online Reward Selection and Policy Optimization Algorithm 2 ORSO with D3RB Algorithm 3 Rejection Sampling in ORSO Algorithm 4 ORSO with Rejection Sampling and Iterative Improvement |
| Open Source Code | Yes | Code is available at https://github.com/Improbable-AI/orso. 2The code for ORSO is available at https://github.com/Improbable-AI/orso |
| Open Datasets | No | The paper uses the Isaac Gym simulator to generate environments and data, rather than using pre-existing datasets. While it mentions environments and reward functions, it does not provide concrete access information for any specific, publicly available dataset used for evaluation. |
| Dataset Splits | No | The paper evaluates performance in simulation environments (Isaac Gym tasks) rather than on pre-collected, static datasets. Therefore, it does not specify explicit training/test/validation splits for a dataset. It discusses 'interaction budgets' but this refers to computational resources for policy training, not dataset partitioning. |
| Hardware Specification | No | The paper mentions 'Isaac Gym: High performance GPU based physics simulation' and discusses 'number of parallel GPUs' in Figure 3, as well as acknowledging 'MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing HPC resources'. However, it does not specify any particular models of GPUs, CPUs, or other hardware components used for the experiments. |
| Software Dependencies | No | The paper mentions several software components like 'proximal policy optimization (PPO) algorithm', 'Clean RL (Huang et al., 2022)', 'GPT-4 (Achiam et al., 2023)', and the 'Isaac Gym simulator (Makoviychuk et al., 2021)'. However, it does not provide specific version numbers for these software dependencies, which are necessary for full reproducibility. |
| Experiment Setup | Yes | For every environment, we set the number of iterations N in Algorithm 1 used to train the policy before we select a different reward function to N = n iters/100, where n iters is the number of iterations used to train the baselines, i.e., we perform at least 100 iterations of online reward selection before the iterative resampling. We consider interaction budgets B {5, 10, 15} n iters and sample sizes K {4, 8, 16}. We provide the pseudocode and the hyperparameters used for each selection algorithm in Appendix G. |