MERMAIDE: Learning to Align Learners using Model-Based Meta-Learning

Authors: Arundhati Banerjee, Soham Rajesh Phade, Stefano Ermon, Stephan Zheng

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate this approach step-by-step. First, in a Stackelberg setting with a best-response agent, we show that meta-learning enables quick convergence to the theoretically known Stackelberg equilibrium at test time, although noisy observations severely increase the sample complexity. We then show that our model-based meta-learning approach is cost-effective in intervening on bandit agents with unseen explore-exploit strategies. Finally, we outperform baselines that use either meta-learning or agent behavior modeling, in both 0-shot and 1-shot settings with partial agent information.
Researcher Affiliation Collaboration Arundhati Banerjee EMAIL School of Computer Science Carnegie Mellon University Soham Phade EMAIL Salesforce Stefano Ermon EMAIL Department of Computer Science Stanford University Stephan Zheng EMAIL Asari AI
Pseudocode Yes Algorithm 1 MERMAIDE (Notations also in Table 4) ... Algorithm 2 MERMAIDE (K-shot Adaptation)
Open Source Code No We plan to release the code for our implementation with the published paper.
Open Datasets No The paper does not explicitly state the use of a named, publicly available dataset with concrete access information (link, DOI, formal citation). It describes generating bandit agents with different base rewards and exploration parameters, implying a synthetic experimental setup.
Dataset Splits Yes Here, we use 15 bandit agents for training and 10 bandit agents for testing, each with different base rewards (both within and across train and test sets).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions 'computational costs involved' in Appendix C.
Software Dependencies No The paper mentions several algorithms and optimizers like REINFORCE (Williams, 1992), MAML (Finn et al., 2017b), and Adam (Kingma & Ba, 2014), but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes In Section 5, the principal policy πp is a fully connected neural network (MLP) with one hidden layer and Re LU activation. ... In Section 6, the recurrent world model and policy networks are GRUs with 2 layers and hidden state dimension 128. For meta-training, the inner gradient update loop uses SGD optimizer with a learning rate of 7 × 10^−4 whereas the meta-update step uses Adam with a learning rate of 0.001. ... We set c = 0.75. ... We measure the performance of the principal using Equation (2), with γ = 1. ... We consider two agent learning algorithms (UCB and ϵ-greedy) and a range of exploration vs exploitation characteristics, determined by their exploration coefficients: β {0.17, 0.27, 0.42, 0.5, 0.67} for UCB ... and ϵ {0.1, 0.2, 0.3, 0.4, 0.5} for ϵ-greedy.