reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization

Authors: Chen Bo Calvin Zhang, Zhang-Wei Hong, Aldo Pacchiano, Pulkit Agrawal

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate ORSO s effectiveness across various continuous control tasks. Compared to prior approaches, ORSO significantly reduces the amount of data required to evaluate a shaping reward function, resulting in superior data efficiency and a significant reduction in computational time (up to 8 ). ORSO consistently identifies high-quality reward functions outperforming prior methods by more than 50% and on average identifies policies as performant as the ones learned using manually engineered reward functions by domain experts. 5 PRACTICAL IMPLEMENTATION AND EXPERIMENTAL RESULTS In this section, we present a practical implementation2 of ORSO and its experimental results on several continuous control tasks.
Researcher Affiliation	Academia	Chen Bo Calvin Zhang , Zhang-Wei Hong , Aldo Pacchiano , Pulkit Agrawal Improbable AI Lab, Massachusetts Institute of Technology ETH Zurich Boston University Broad Institute of MIT and Harvard Correspondence to EMAIL, EMAIL.
Pseudocode	Yes	Algorithm 1 ORSO: Online Reward Selection and Policy Optimization Algorithm 2 ORSO with D3RB Algorithm 3 Rejection Sampling in ORSO Algorithm 4 ORSO with Rejection Sampling and Iterative Improvement
Open Source Code	Yes	Code is available at https://github.com/Improbable-AI/orso. 2The code for ORSO is available at https://github.com/Improbable-AI/orso
Open Datasets	No	The paper uses the Isaac Gym simulator to generate environments and data, rather than using pre-existing datasets. While it mentions environments and reward functions, it does not provide concrete access information for any specific, publicly available dataset used for evaluation.
Dataset Splits	No	The paper evaluates performance in simulation environments (Isaac Gym tasks) rather than on pre-collected, static datasets. Therefore, it does not specify explicit training/test/validation splits for a dataset. It discusses 'interaction budgets' but this refers to computational resources for policy training, not dataset partitioning.
Hardware Specification	No	The paper mentions 'Isaac Gym: High performance GPU based physics simulation' and discusses 'number of parallel GPUs' in Figure 3, as well as acknowledging 'MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing HPC resources'. However, it does not specify any particular models of GPUs, CPUs, or other hardware components used for the experiments.
Software Dependencies	No	The paper mentions several software components like 'proximal policy optimization (PPO) algorithm', 'Clean RL (Huang et al., 2022)', 'GPT-4 (Achiam et al., 2023)', and the 'Isaac Gym simulator (Makoviychuk et al., 2021)'. However, it does not provide specific version numbers for these software dependencies, which are necessary for full reproducibility.
Experiment Setup	Yes	For every environment, we set the number of iterations N in Algorithm 1 used to train the policy before we select a different reward function to N = n iters/100, where n iters is the number of iterations used to train the baselines, i.e., we perform at least 100 iterations of online reward selection before the iterative resampling. We consider interaction budgets B {5, 10, 15} n iters and sample sizes K {4, 8, 16}. We provide the pseudocode and the hyperparameters used for each selection algorithm in Appendix G.