reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ManiSkill-HAB: A Benchmark for Low-Level Manipulation in Home Rearrangement Tasks

Authors: Arth Shukla, Stone Tao, Hao Su

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train extensive reinforcement learning (RL) and imitation learning (IL) baselines for future work to compare against. Leveraging our fast environments, we run extensive RL baselines, training 150 policies across 3 seeds (50 policies/seed) with 1.83 billion environment samples.
Researcher Affiliation	Collaboration	Arth Shukla, Stone Tao & Hao Su Hillbot Inc. and University of California, San Diego EMAIL
Pseudocode	No	The paper describes methods and processes textually and through mathematical formulations, but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, figures, or sections.
Open Source Code	Yes	Videos, models, data, code, and more at http://arth-shukla.github.io/mshab
Open Datasets	Yes	The Home Assistant Benchmark (HAB) (Szot et al., 2021) includes three long-horizon tasks which involve rearranging objects from the YCB dataset (C alli et al., 2015): The Replica CAD dataset (Szot et al., 2021) serves as the source for our apartment scenes.
Dataset Splits	Yes	The dataset is split into three parts: 3 macro-variations for training, 1 for validation, and 1 for testing. However, as the test split is not publicly accessible, our study utilizes only the train and validation splits. Furthermore, for each long-horizon task, HAB provides 10,000 training episode configurations and 1,000 validation configurations.
Hardware Specification	Yes	Our benchmarking is conducted on a machine equipped with a 16-core/32-thread Intel i9-12900KS processor and an Nvidia RTX 4090 GPU with 24 GB VRAM.
Software Dependencies	No	The paper mentions several algorithms and frameworks like SAC, PPO, D4PG, Nature CNN, and Mani Skill3, but it does not specify any version numbers for these software components or libraries.
Experiment Setup	Yes	We stack 3 consecutive frames for image observations to handle partial observability. We train Pick and Place using SAC (Haarnoja et al., 2018; Xing, 2022) with a 1m replay buffer size. Visual observations are encoded by D4PG s 4-layer CNN (Barth-Maron et al., 2018) and concatenated with state observations. Actor and critic networks are 3-layer MLPs and the critic has Layer Norm to avoid value divergence (Ball et al., 2023). We train Pick with 50M timesteps and Place with 25M timesteps. We train Open and Close using PPO (Schulman et al., 2017; Huang et al., 2022). Visual observations are encoded by a Nature CNN (Mnih et al., 2015) and concatenated with state observations. The actor and critic networks are 2-layer MLPs. We train Open Fridge with 15M timesteps, Open Drawer with 50M timesteps, Close Fridge with 25M timesteps, and Close Drawer with 15M timesteps. We train 3 seeds for each task/subtask/object combination, evaluating on 189 episodes every 100,000 train samples. We select the checkpoint with highest evaluation success once rate as our final policy.