reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MERMAIDE: Learning to Align Learners using Model-Based Meta-Learning

Authors: Arundhati Banerjee, Soham Rajesh Phade, Stefano Ermon, Stephan Zheng

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate this approach step-by-step. First, in a Stackelberg setting with a best-response agent, we show that meta-learning enables quick convergence to the theoretically known Stackelberg equilibrium at test time, although noisy observations severely increase the sample complexity. We then show that our model-based meta-learning approach is cost-effective in intervening on bandit agents with unseen explore-exploit strategies. Finally, we outperform baselines that use either meta-learning or agent behavior modeling, in both 0-shot and 1-shot settings with partial agent information.
Researcher Affiliation	Collaboration	Arundhati Banerjee EMAIL School of Computer Science Carnegie Mellon University Soham Phade EMAIL Salesforce Stefano Ermon EMAIL Department of Computer Science Stanford University Stephan Zheng EMAIL Asari AI
Pseudocode	Yes	Algorithm 1 MERMAIDE (Notations also in Table 4) ... Algorithm 2 MERMAIDE (K-shot Adaptation)
Open Source Code	No	We plan to release the code for our implementation with the published paper.
Open Datasets	No	The paper does not explicitly state the use of a named, publicly available dataset with concrete access information (link, DOI, formal citation). It describes generating bandit agents with different base rewards and exploration parameters, implying a synthetic experimental setup.
Dataset Splits	Yes	Here, we use 15 bandit agents for training and 10 bandit agents for testing, each with different base rewards (both within and across train and test sets).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions 'computational costs involved' in Appendix C.
Software Dependencies	No	The paper mentions several algorithms and optimizers like REINFORCE (Williams, 1992), MAML (Finn et al., 2017b), and Adam (Kingma & Ba, 2014), but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	In Section 5, the principal policy πp is a fully connected neural network (MLP) with one hidden layer and Re LU activation. ... In Section 6, the recurrent world model and policy networks are GRUs with 2 layers and hidden state dimension 128. For meta-training, the inner gradient update loop uses SGD optimizer with a learning rate of 7 × 10^−4 whereas the meta-update step uses Adam with a learning rate of 0.001. ... We set c = 0.75. ... We measure the performance of the principal using Equation (2), with γ = 1. ... We consider two agent learning algorithms (UCB and ϵ-greedy) and a range of exploration vs exploitation characteristics, determined by their exploration coefficients: β {0.17, 0.27, 0.42, 0.5, 0.67} for UCB ... and ϵ {0.1, 0.2, 0.3, 0.4, 0.5} for ϵ-greedy.