reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AssistanceZero: Scalably Solving Assistance Games

Authors: Cassidy Laidlaw, Eli Bronstein, Timothy Guo, Dylan Feng, Lukas Berglund, Justin Svegliato, Stuart Russell, Anca Dragan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that Assistance Zero outperforms model-free RL algorithms and imitation learning in the Minecraft-based assistance game. In a human study, our Assistance Zero-trained assistant significantly reduces the number of actions participants take to complete building tasks in Minecraft.
Researcher Affiliation	Academia	1University of California, Berkeley, CA, USA. Correspondence to: Cassidy Laidlaw <cassidy EMAIL>.
Pseudocode	No	The paper provides detailed descriptions of the MCTS and training procedures in Appendix A, including formulas and step-by-step explanations, but does not present these in a distinct block explicitly labeled as "Pseudocode", "Algorithm", or a code-like formatted procedure.
Open Source Code	Yes	Our code and models are available at https: //github.com/cassidylaidlaw/ minecraft-building-assistance-game.
Open Datasets	Yes	At the start of an episode, the goal is sampled from a dataset of houses based on the Craft Assist dataset (Gray et al., 2019).
Dataset Splits	No	We maintain separate train and test datasets to evaluate generalization. At the beginning of each training episode, a goal structure θ is randomly sampled from the training dataset Dtrain. We collect 18 episodes in MBAG of five human subjects building houses randomly selected from Dtrain. We randomly sample a unique goal structure for each participant from our test set Dtest. All training uses houses from the train set Dtrain, while all training uses houses from the train set Dtrain; thus, we always test human models and assistants on unseen goal structures. However, specific percentages or sample counts for these train/test splits are not provided.
Hardware Specification	Yes	When evaluating Assistance Zero assistants, we use only 20 simulations of MCTS, which is roughly the number that can run in real-time with Minecraft on an NVIDIA GeForce 1080 Ti GPU.
Software Dependencies	No	We implement all RL and imitation learning algorithms in RLlib (Liang et al., 2018) and Py Torch (Paszke et al., 2019). The paper mentions the software used (RLlib and Py Torch) but does not specify their version numbers.
Experiment Setup	Yes	Hyperparameters for BC human models (Table 8): Epochs 30, Dropout 0.7, SGD batch size 128, Learning rate 10 3. Hyperparameters for PPO human model training (Table 11): Training iterations 100, Rollout length 500, Number of environments 640, SGD batch size 512, Learning rate 3 10 4. Hyperparameters for PPO assistant training (Table 12): Training iterations 300, Rollout length 64, Number of environments 256, SGD minibatch size 256, Learning rate 3 10 4. Assistance Zero hyperparameters for MBAG (Table 14): Training iterations 500, Rollout length per iteration per environment 64, Number of environments 256, Replay buffer size 262,144, SGD batch size 256, Learning rate 10 3.