reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generalized Behavior Learning from Diverse Demonstrations

Authors: Varshith Sreeramdass, Rohan Paleja, Letian Chen, Sanne van Waveren, Matthew Gombolay

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate across three continuous control benchmarks for generalizing to in-distribution (interpolation) and out-of-distribution (extrapolation) factors that GSD outperforms baselines in novel behavior discovery by 21%. Finally, we demonstrate that GSD can generalize striking behaviors for table tennis in a virtual testbed while leveraging human demonstrations collected in the real world.
Researcher Affiliation	Academia	Varshith Sreeramdass, Rohan Paleja, Letian Chen, Sanne van Waveren, Matthew Gombolay Georgia Institute of Technology EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Guided Strategy Discovery
Open Source Code	Yes	Code is available at github.com/CORE-Robotics-Lab/GSD.
Open Datasets	Yes	The Half Cheetah environment considered in Sec. 6 is from Open AI Gym (Brockman et al., 2016). The Fetch Pick Place environment considered is from the gym library (Brockman et al., 2016). The Drive Laneshift environment is built from the highway-env library (Leurent, 2018).
Dataset Splits	Yes	Splits: We divide the bounded 1D factor range into five consecutive equal-sized intervals: Interpolation: The first, third, and fifth intervals represent the train region, and the second and fourth are the test region. The split allows us to evaluate the ability to interpolate behaviors to two factor space intervals while providing three non-consecutive intervals to represent the factor. Extrapolation: The second and fourth intervals represent the train region, while the first and fifth intervals are the test region. We choose two non-consecutive intervals for the train region to have a sparse dataset while providing enough diversity to represent the factor. We use five demonstrations per interval (details in Appendix B).
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used for running the experiments. While a 'Barrett WAM Arm' is mentioned for the Table Tennis setup, this refers to the robotic hardware for physical demonstrations/simulation, not the computing hardware for training models.
Software Dependencies	No	The paper mentions 'Py Torch (Imambi et al., 2021)' but does not specify a version number for PyTorch or any other key software libraries used in the implementation.
Experiment Setup	Yes	The hyperparameters used in our optimization are listed in Tables 1, 2. Each method is independently tuned for λI (and λC for Lipz, GSD) over the specified ranges, to maximize MAE over the test split for K=10 over averaged over four rounds of evaluation and five train seeds. All hyperparameters omitted from the tables are set to default values from our base implementation.