POMRL: No-Regret Learning-to-Plan with Increasing Horizons
Authors: Khimya Khetarpal, Claire Vernade, Brendan O'Donoghue, Satinder Singh, Tom Zahavy
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study the problem of planning under model uncertainty in an online meta-reinforcement learning (RL) setting where an agent is presented with a sequence of related tasks with limited interactions per task. The agent can use its experience in each task and across tasks to estimate both the transition model and the distribution over tasks. We propose an algorithm to meta-learn the underlying relatedness across tasks, utilize it to plan in each task, and upper-bound the regret of the planning loss. Our bound suggests that the average regret over tasks decreases as the number of tasks increases and as the tasks are more similar. In the classical single-task setting, it is known that the planning horizon should depend on the estimated model s accuracy, that is, on the number of samples within task. We generalize this finding to meta-RL and study this dependence of planning horizons on the number of tasks. Based on our theoretical findings, we derive heuristics for selecting slowly increasing discount factors, and we validate its significance empirically. |
| Researcher Affiliation | Collaboration | Khimya Khetarpal EMAIL Department of Computer Science Mc Gill University Claire Vernade EMAIL University of Tuebingen Brendan O Donoghue EMAIL Google Deepmind Satinder Singh Baveja EMAIL Google Deepmind Tom Zahavy EMAIL Google Deepmind |
| Pseudocode | Yes | Algorithm 1: POMRL (σ) Planning with Online Meta-Reinforcement Learning Input: Given task-similarity (σ(s, a)) a matrix of size S A. Initialize ˆP o,1 to uniform, α1 = 0. for task t [T] do for tth batch of m samples do ˆ P t(m) = (1 αt) 1 m Pm i=1 Xi + αt ˆP o,t // regularized least squares minimizer. γ γ-Selection-Procedure(m, αt, σ, T, S, A) π ˆ P t,γ Planning( ˆ P t(m)) // Output: π ˆ P t,γ Update ˆP o,t+1, αt+1 = 1 σ2(1+1/t)m+1 // meta-update Ao M (Eq. 5) and mixing rate Algorithm 2: ada-POMRL Planning with Online Meta-Reinforcement Learning Input: Initialize ˆP o,1 to uniform, (ˆσ)1 as a matrix of size S A,α1 = 0. for task t [T] do for tth batch of m samples do ˆ P t(m) = (1 αt) 1 m Pm i=1 Xi + αt ˆP o,t // regularized least squares minimizer. γ γ-Selection-Procedure(m, αt, σt, T, S, A) π ˆ P t,γ Planning( ˆ P t(m)) Output: π ˆ P t,γ Update ˆP o,t+1, ˆσt+1 Welford s online algorithm ( ˆσo)t, ˆP o,t+1, ˆP o,t // meta-update Ao M (Eq. 5) and task-similarity parameter. Update αt+1 = 1 ˆσt+12(1+1/t)m+1 // meta-update mixing rate, plug max(σS A) |
| Open Source Code | Yes | Source code is provided in the supplementary material. |
| Open Datasets | No | Setting: For each experiment, we fix a mean model P o S A S (see below how), and for each new task t [T], we sample P t from a Dirichlet distribution3 centered at P o. As prescribed by theory (see Sec.3.2), we set4 σ 0.01 1/S m unless otherwise specified (see Q3). Note that σ and ˆσ respectively, are S A matrices of state-action dependent variances that capture the directional variance as we used Dirichlet distributions as priors and these have non-uniform variance levels in the simplex, depending on how close to the simplex boundary the mean is located. Aligned with our theory, we use the max of the σ matrices resulting in the aforementioned single scalar value. As in Jiang et al. (2015), P o (and each P t) characterizes a random chain MDP with S = 10 states5 and A = 2 actions, which is drawn such that, for each state action pair, the transition function P(s, a, s ) is constructed by choosing randomly k = 5 states whose probability is set to 0. Then we draw the value of the S k remaining states uniformly in [0, 1] and normalize the resulting vector. |
| Dataset Splits | No | The paper describes generating tasks from a Dirichlet distribution and using 'm' samples per task. It does not mention explicit training/test/validation splits of a static dataset, but rather an online meta-learning setting where the agent processes a sequence of tasks. |
| Hardware Specification | No | All experiments were conducted using Google Colab instances7. 7https://colab.research.google.com/ |
| Software Dependencies | No | The paper mentions using Google Colab instances, which implies a software environment, but it does not specify any particular programming languages, libraries, or their version numbers used for implementation. |
| Experiment Setup | Yes | For each experiment, we fix a mean model P o S A S (see below how), and for each new task t [T], we sample P t from a Dirichlet distribution3 centered at P o. As prescribed by theory (see Sec.3.2), we set4 σ 0.01 1/S m unless otherwise specified (see Q3). ... For all baselines, the number of samples per task m = 5. Results are averaged over 100 independent runs. Besides, we also propose and empirically validate competitive heuristics for γ-Selection-Procedure in Sec. 6. Besides, we also run another baseline called Aggregating(α = 1), that simply ignores the meta-RL structure and just plans assuming there is a single task (See Appendix F.2). |