reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MaestroMotif: Skill Design from Artificial Intelligence Feedback

Authors: Martin Klissarov, Mikael Henaff, Roberta Raileanu, Shagun Sodhani, Pascal Vincent, Amy Zhang, Pierre-Luc Bacon, Doina Precup, Marlos C. Machado, Pierluca D'Oro

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Maestro Motif using a suite of complex tasks in the Net Hack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability. We perform a detailed evaluation of the abilities of Maestro Motif on the NLE and compare its performance to a variety of baselines.
Researcher Affiliation	Collaboration	1 Mila, 2 Meta, 3 University of Texas Austin, 4 Universit e de Montr eal, 5 Mc Gill University, 6 University of Alberta, 7 Amii, 8 Canada CIFAR AI Chair
Pseudocode	No	The paper describes the method in prose and diagrams (e.g., Figure 2) and provides LLM-generated Python code examples in the appendix (Output 1, 3, 4), but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper references existing open-source baselines and environments used for implementation but does not explicitly provide a link or statement for the open-sourcing of Maestro Motif's own methodology.
Open Datasets	Yes	Additionally, we use the Dungeons and Data dataset of unannotated human gameplays (Hambro et al., 2022b). Dungeons and data: A large-scale nethack dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022b. URL https://openreview.net/forum?id=z HNNSzo10x N.
Dataset Splits	No	The paper describes using generated preferences and the Dungeons and Data dataset for training but does not provide specific training/test/validation dataset splits with percentages or sample counts for reproducing experiments.
Hardware Specification	No	The paper mentions 'training runs of several GPU-days' and 'Number of Workers 24' but does not specify exact GPU/CPU models or other detailed hardware specifications used for experiments.
Software Dependencies	Yes	We use Llama 3.1 70B (Dubey et al., 2024) via v LLM (Kwon et al., 2023) as the LLM annotator. Maestro Motif uses Llama 3.1 405b to generate code.
Experiment Setup	Yes	Hyperparameter Value (Table 2): Reward Scale 0.1, Observation Scale 255, Num. of Workers 24, Batch Size 4096, Num. of Environments per Worker 20, PPO Clip Ratio 0.1, PPO Clip Value 1.0, PPO Epochs 1, Max Grad Norm 4.0, Value Loss Coeff 0.5, Exploration Loss entropy. To obtain the LLM-based reward, we train for 20 epochs using a learning rate of 1 10 5. The value of the count exponent was 3 whereas for the threshold we used the 85th quantile of the empirical reward distribution for each skill, except the Discoverer which used the 95th quantile.