reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning

Authors: Alexander Nikulin, Ilya Zisman, Alexey Zemtsov, Vladislav Kurenkov

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present XLand-100B, a large-scale dataset for in-context reinforcement learning based on the XLand-Mini Grid environment, as a first step to alleviate this problem. It contains complete learning histories for nearly 30, 000 different tasks, covering 100B transitions and 2.5B episodes. It took 50, 000 GPU hours to collect the dataset, which is beyond the reach of most academic labs. Along with the dataset, we provide the utilities to reproduce or expand it even further. We also benchmark common in-context RL baselines and show that they struggle to generalize to novel and diverse tasks. In this section, we investigate whether our datasets can enable an in-context RL ability. Additionally, we demonstrate how well current in-context algorithms perform across different task complexities and outline their current limitations. We take AD (Laskin et al., 2022) and DPT (Lee et al., 2023) for our experiments...
Researcher Affiliation	Collaboration	Alexander Nikulin AIRI, MIPT Ilya Zisman AIRI, Skoltech Alexey Zemtsov NUST MISIS, T-Tech Vladislav Kurenkov AIRI, Innopolis University
Pseudocode	No	The paper describes methods like Algorithm Distillation (AD) and Decision-Pretrained Transformer (DPT) in prose within Section 2.1 and Section 4.2, and provides details on data collection and evaluation, but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We also release the codebase with tools for creating and expanding the dataset in the following repository: xland-minigrid-datasets.
Open Datasets	Yes	Both XLand-100B and XLand-Trivial-20B datasets hosted on public S3 bucket and are freely available for everyone under CC BY-SA 4.0 Licence. We advise starting with Trivial dataset for debugging due to smaller size and faster downloading time. Datasets can be downloaded with the curl (or any other similar) utility. # XLand-Trivial-20B, approx 60GB size curl -L -o xland-trivial-20b.hdf5 https://tinyurl.com/trivial-10k # XLand-100B, approx 325GB size curl -L -o xland-100b.hdf5 https://tinyurl.com/medium-30k
Dataset Splits	Yes	For our main XLand-100B dataset we uniformly sampled tasks from medium-1m benchmark from XLand-Mini Grid. ... We finetune the agent using 8192 parallel environments for 1B transitions on 30k uniformly sampled tasks from medium-1m benchmark. ... For evaluation, we run three models on 1024 unseen tasks for 500 episodes. ... We run evaluation for each model for 500 episodes, reporting mean return across 1024 unseen tasks with standard deviation across 3 seeds.
Hardware Specification	Yes	The approximate time of training for single epoch on a -100B dataset and evaluation on 1024 tasks on 8 H100 GPUs is shown in the Table 5. ... All experiments ran on 8 A100 GPUs.
Software Dependencies	No	The paper mentions software like JAX, Flash Attention-2, ALi Bi positional embeddings, and Deep Speed. However, specific version numbers for these software dependencies are not provided in the text, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	We provide exact hyperparameters for each stage in Appendix O. ... Table 7: DPT Hyperparameters ... Table 8: AD Hyperparameters ... Table 9: PPO hyperparameters used in multi-task pre-training from Section 4.2. ... Table 10: PPO hyperparameters used in single-task fine-tuning from Section 4.2.