Subtask-Aware Visual Reward Learning from Segmented Demonstrations

Authors: Changyeon Kim, Minho Heo, Doohyun Lee, Honglak Lee, Jinwoo Shin, Joseph Lim, Kimin Lee

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that REDS significantly outperforms baseline methods on complex robotic manipulation tasks in Meta-World and more challenging real-world tasks, such as furniture assembly in Furniture Bench, with minimal human intervention. Moreover, REDS facilitates generalization to unseen tasks and robot embodiments, highlighting its potential for scalable deployment in diverse environments.
Researcher Affiliation Collaboration 1KAIST 2University of Michigan 3LG AI Research
Pseudocode No The paper describes the training and inference procedures in Section 4.4 using numbered steps (Step 1, Step 2, Step 3) within paragraph text, but it does not present these as structured pseudocode or an algorithm block explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Project page: https://changyeon.site/reds. In addition, to further facilitate the reproduction, we release the open-sourced implementation through the project website.
Open Datasets Yes We conduct extensive experiments in robotic manipulation tasks from Meta-world (Yu et al., 2020) (see Section 5.1) in simulation and robotic furniture assembly tasks from Furniture Bench (Heo et al., 2023) (see Section 5.2) in the real-world. RLBench (James et al., 2020).
Dataset Splits Yes For training REDS, we first collect subtask segmentations from 50 expert demonstrations for initial training and train Dreamer V3 agents for 100K environment steps with the initial reward model to collect suboptimal trajectories, which is used for fine-tuning. We use 300 expert demonstrations with subtask segmentations provided by Furniture Bench, along with an additional 200 rollouts from IQL (Kostrikov et al., 2022) policy trained with expert demonstrations in a single training iteration.
Hardware Specification Yes We use 24 Intel Xeon CPU @ 2.2GHz CPU cores and 4 NVIDIA RTX 3090 GPUs for training our reward model, which takes about 1.5 hours in Meta-world and 3 hours in Furniture Bench due to high-resolution visual observations from multiple views. For training Dreamer V3 agents in Meta-world, we use 24 Intel Xeon CPU @ 2.2GHz CPU cores and a single NVIDIA RTX 3090 GPU, which takes approximately 4 hours over 500K environment steps. For training IQL agents in Furniture Bench, we use 24 Intel Xeon CPU @ 2.2GHz CPU cores and a single NVIDIA RTX 3090 GPU, taking approximately 2 hours for 1M gradient steps in offline RL and 4.5 hours over 150 episodes of environment interactions in online RL.
Software Dependencies No The paper mentions several software components and models like CLIP (Radford et al., 2021a) with Vi T-B/16 architecture, GPT (Radford et al., 2018) architecture, Adam W (Loshchilov & Hutter, 2019) optimizer, Dreamer V3 (Hafner et al., 2023), and IQL (Kostrikov et al., 2022). However, it does not provide specific version numbers for general software libraries or frameworks (e.g., Python, PyTorch, TensorFlow) beyond the model architectures themselves.
Experiment Setup Yes All models are trained with Adam W (Loshchilov & Hutter, 2019) optimizer with a learning rate of 1 10 4 and a mini-batch size of 32. To ensure to visual distractions, we apply color jittering and random shift (Yarats et al., 2021) to visual observations in training REDS. Please refer to Appendix A for more details. We report the hyperparameters used in our experiments in Table 3.