ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization

Authors: Chen Bo Calvin Zhang, Zhang-Wei Hong, Aldo Pacchiano, Pulkit Agrawal

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate ORSO s effectiveness across various continuous control tasks. Compared to prior approaches, ORSO significantly reduces the amount of data required to evaluate a shaping reward function, resulting in superior data efficiency and a significant reduction in computational time (up to 8 ). ORSO consistently identifies high-quality reward functions outperforming prior methods by more than 50% and on average identifies policies as performant as the ones learned using manually engineered reward functions by domain experts. 5 PRACTICAL IMPLEMENTATION AND EXPERIMENTAL RESULTS In this section, we present a practical implementation2 of ORSO and its experimental results on several continuous control tasks.
Researcher Affiliation Academia Chen Bo Calvin Zhang , Zhang-Wei Hong , Aldo Pacchiano , Pulkit Agrawal Improbable AI Lab, Massachusetts Institute of Technology ETH Zurich Boston University Broad Institute of MIT and Harvard Correspondence to EMAIL, EMAIL.
Pseudocode Yes Algorithm 1 ORSO: Online Reward Selection and Policy Optimization Algorithm 2 ORSO with D3RB Algorithm 3 Rejection Sampling in ORSO Algorithm 4 ORSO with Rejection Sampling and Iterative Improvement
Open Source Code Yes Code is available at https://github.com/Improbable-AI/orso. 2The code for ORSO is available at https://github.com/Improbable-AI/orso
Open Datasets No The paper uses the Isaac Gym simulator to generate environments and data, rather than using pre-existing datasets. While it mentions environments and reward functions, it does not provide concrete access information for any specific, publicly available dataset used for evaluation.
Dataset Splits No The paper evaluates performance in simulation environments (Isaac Gym tasks) rather than on pre-collected, static datasets. Therefore, it does not specify explicit training/test/validation splits for a dataset. It discusses 'interaction budgets' but this refers to computational resources for policy training, not dataset partitioning.
Hardware Specification No The paper mentions 'Isaac Gym: High performance GPU based physics simulation' and discusses 'number of parallel GPUs' in Figure 3, as well as acknowledging 'MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing HPC resources'. However, it does not specify any particular models of GPUs, CPUs, or other hardware components used for the experiments.
Software Dependencies No The paper mentions several software components like 'proximal policy optimization (PPO) algorithm', 'Clean RL (Huang et al., 2022)', 'GPT-4 (Achiam et al., 2023)', and the 'Isaac Gym simulator (Makoviychuk et al., 2021)'. However, it does not provide specific version numbers for these software dependencies, which are necessary for full reproducibility.
Experiment Setup Yes For every environment, we set the number of iterations N in Algorithm 1 used to train the policy before we select a different reward function to N = n iters/100, where n iters is the number of iterations used to train the baselines, i.e., we perform at least 100 iterations of online reward selection before the iterative resampling. We consider interaction budgets B {5, 10, 15} n iters and sample sizes K {4, 8, 16}. We provide the pseudocode and the hyperparameters used for each selection algorithm in Appendix G.