MERMAIDE: Learning to Align Learners using Model-Based Meta-Learning
Authors: Arundhati Banerjee, Soham Rajesh Phade, Stefano Ermon, Stephan Zheng
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate this approach step-by-step. First, in a Stackelberg setting with a best-response agent, we show that meta-learning enables quick convergence to the theoretically known Stackelberg equilibrium at test time, although noisy observations severely increase the sample complexity. We then show that our model-based meta-learning approach is cost-effective in intervening on bandit agents with unseen explore-exploit strategies. Finally, we outperform baselines that use either meta-learning or agent behavior modeling, in both 0-shot and 1-shot settings with partial agent information. |
| Researcher Affiliation | Collaboration | Arundhati Banerjee EMAIL School of Computer Science Carnegie Mellon University Soham Phade EMAIL Salesforce Stefano Ermon EMAIL Department of Computer Science Stanford University Stephan Zheng EMAIL Asari AI |
| Pseudocode | Yes | Algorithm 1 MERMAIDE (Notations also in Table 4) ... Algorithm 2 MERMAIDE (K-shot Adaptation) |
| Open Source Code | No | We plan to release the code for our implementation with the published paper. |
| Open Datasets | No | The paper does not explicitly state the use of a named, publicly available dataset with concrete access information (link, DOI, formal citation). It describes generating bandit agents with different base rewards and exploration parameters, implying a synthetic experimental setup. |
| Dataset Splits | Yes | Here, we use 15 bandit agents for training and 10 bandit agents for testing, each with different base rewards (both within and across train and test sets). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions 'computational costs involved' in Appendix C. |
| Software Dependencies | No | The paper mentions several algorithms and optimizers like REINFORCE (Williams, 1992), MAML (Finn et al., 2017b), and Adam (Kingma & Ba, 2014), but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | In Section 5, the principal policy πp is a fully connected neural network (MLP) with one hidden layer and Re LU activation. ... In Section 6, the recurrent world model and policy networks are GRUs with 2 layers and hidden state dimension 128. For meta-training, the inner gradient update loop uses SGD optimizer with a learning rate of 7 × 10^−4 whereas the meta-update step uses Adam with a learning rate of 0.001. ... We set c = 0.75. ... We measure the performance of the principal using Equation (2), with γ = 1. ... We consider two agent learning algorithms (UCB and ϵ-greedy) and a range of exploration vs exploitation characteristics, determined by their exploration coefficients: β {0.17, 0.27, 0.42, 0.5, 0.67} for UCB ... and ϵ {0.1, 0.2, 0.3, 0.4, 0.5} for ϵ-greedy. |