GROOT-2: Weakly Supervised Multimodal Instruction Following Agents

Authors: Shaofei Cai, Bowei Zhang, Zihao Wang, Haowei Lin, Xiaojian Ma, Anji Liu, Yitao Liang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental GROOT-2 s effectiveness is validated across four diverse environments, ranging from video games to robotic manipulation, demonstrating its robust multimodal instruction-following capabilities. We conduct experiments across four types of representative environments: classical 2D game-playing benchmarks on Atari (Bellemare et al., 2013), 3D open-world gameplaying benchmarks on Minecraft (Johnson et al., 2016; Lin et al., 2023), and Robotics benchmarks on Language Table simulator (Lynch et al., 2023) and Simpler Env simulator(Li et al., 2024).
Researcher Affiliation Academia 1Institute for Artificial Intelligence, Peking University 2School of Intelligence Science and Technology, Peking University 3School of Electronics Engineering and Computer Science, Peking University 4Computer Science Department, University of California, Los Angeles 5Beijing Institute for General Artificial Intelligence (BIGAI) EMAIL, EMAIL EMAIL, EMAIL
Pseudocode No The paper describes the methodology using mathematical equations and text, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code No The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a direct link to a code repository. It mentions accessing dataset via a GitHub link (https://github.com/takuseno/d4rl-atari) and Open X dataset, but these are for external resources, not their own implementation code.
Open Datasets Yes Our approach is both general and flexible, as demonstrated through evaluations across four diverse environments ranging from video games to robotic manipulation including Atari Games (Bellemare et al., 2013), Minecraft (Johnson et al., 2016), Language Table Lynch et al. (2023), and Simpler Env (Li et al., 2024). Results on the Open-World Minecraft Benchmark. To evaluate policy models in Minecraft, we used the contractor dataset from Baker et al. (2022). Results on the Language Table benchmark. We utilize a dataset provided by Lynch et al. (2023) comprising 100M trajectories. Results on the Simpler Env Benchmark. GROOT-2 is trained on the Open X dataset (Collaboration et al., 2023). Can GROOT-2 Follow Instructions Beyond Language and Video, Like Episode Returns? Datasets from Agarwal et al. (2020), containing approximately 10M frames per game, were used.
Dataset Splits Yes To evaluate policy models in Minecraft, we used the contractor dataset from Baker et al. (2022), containing 160M frames. According to the meta information, labeled trajectories account for approximately 35% of the total data. We removed the text labels from half of the trajectories in the dataset, creating a 1 : 1 ratio of labeled to unlabeled trajectories. We erased the text labels from half of the dataset s trajectories, achieving a 1:1 balance between labeled and unlabeled data. For training, we constructed a dataset with 30% labeled trajectories (returns) and 70% unlabeled data. We trained four GROOT-2 variants with 0%, 25%, 50%, and 100% unlabeled data in Minecraft. At low labeled data proportions (0% 25%), the success rate rapidly increased from 10% to 65%, indicating that labeled data significantly influences model performance. However, as the labeled data proportion increased to 50% 80%, the success rate plateaued, rising slightly from 82% to 83%.
Hardware Specification Yes Table 6: Hyperparameters for training GROOT-2. ... Type of GPUs NVIDIA A800 ... Parallel GPUs 8
Software Dependencies No The paper mentions the use of
Experiment Setup Yes Table 6: Hyperparameters for training GROOT-2. Hyperparameter Value Optimizer Adam W Weight Decay 0.001 Learning Rate 0.0000181 Warmup Steps 2000 Number of Workers 4 Parallel Strategy ddp Type of GPUs NVIDIA A800 Parallel GPUs 8 Accumulate Gradient Batches 1 Batch Size/GPU (Total) 16 (128) Training Precision bf16 Input Image Size 224 224 Visual Backbone Vi T/32 Encoder Transformer min GPT (w/o causal mask) Decoder Transformer Transformer XL Number of Encoder Blocks 8 Number of Decoder Blocks 4 Hidden Dimension 1024 Trajectory Chunk size 128 Attention Memory Size 256 β1 0.1 β2 0.1