MaestroMotif: Skill Design from Artificial Intelligence Feedback

Authors: Martin Klissarov, Mikael Henaff, Roberta Raileanu, Shagun Sodhani, Pascal Vincent, Amy Zhang, Pierre-Luc Bacon, Doina Precup, Marlos C. Machado, Pierluca D'Oro

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Maestro Motif using a suite of complex tasks in the Net Hack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability. We perform a detailed evaluation of the abilities of Maestro Motif on the NLE and compare its performance to a variety of baselines.
Researcher Affiliation Collaboration 1 Mila, 2 Meta, 3 University of Texas Austin, 4 Universit e de Montr eal, 5 Mc Gill University, 6 University of Alberta, 7 Amii, 8 Canada CIFAR AI Chair
Pseudocode No The paper describes the method in prose and diagrams (e.g., Figure 2) and provides LLM-generated Python code examples in the appendix (Output 1, 3, 4), but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper references existing open-source baselines and environments used for implementation but does not explicitly provide a link or statement for the open-sourcing of Maestro Motif's own methodology.
Open Datasets Yes Additionally, we use the Dungeons and Data dataset of unannotated human gameplays (Hambro et al., 2022b). Dungeons and data: A large-scale nethack dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022b. URL https://openreview.net/forum?id=z HNNSzo10x N.
Dataset Splits No The paper describes using generated preferences and the Dungeons and Data dataset for training but does not provide specific training/test/validation dataset splits with percentages or sample counts for reproducing experiments.
Hardware Specification No The paper mentions 'training runs of several GPU-days' and 'Number of Workers 24' but does not specify exact GPU/CPU models or other detailed hardware specifications used for experiments.
Software Dependencies Yes We use Llama 3.1 70B (Dubey et al., 2024) via v LLM (Kwon et al., 2023) as the LLM annotator. Maestro Motif uses Llama 3.1 405b to generate code.
Experiment Setup Yes Hyperparameter Value (Table 2): Reward Scale 0.1, Observation Scale 255, Num. of Workers 24, Batch Size 4096, Num. of Environments per Worker 20, PPO Clip Ratio 0.1, PPO Clip Value 1.0, PPO Epochs 1, Max Grad Norm 4.0, Value Loss Coeff 0.5, Exploration Loss entropy. To obtain the LLM-based reward, we train for 20 epochs using a learning rate of 1 10 5. The value of the count exponent was 3 whereas for the threshold we used the 85th quantile of the empirical reward distribution for each skill, except the Discoverer which used the 95th quantile.