reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

POMRL: No-Regret Learning-to-Plan with Increasing Horizons

Authors: Khimya Khetarpal, Claire Vernade, Brendan O'Donoghue, Satinder Singh, Tom Zahavy

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study the problem of planning under model uncertainty in an online meta-reinforcement learning (RL) setting where an agent is presented with a sequence of related tasks with limited interactions per task. The agent can use its experience in each task and across tasks to estimate both the transition model and the distribution over tasks. We propose an algorithm to meta-learn the underlying relatedness across tasks, utilize it to plan in each task, and upper-bound the regret of the planning loss. Our bound suggests that the average regret over tasks decreases as the number of tasks increases and as the tasks are more similar. In the classical single-task setting, it is known that the planning horizon should depend on the estimated model s accuracy, that is, on the number of samples within task. We generalize this ﬁnding to meta-RL and study this dependence of planning horizons on the number of tasks. Based on our theoretical ﬁndings, we derive heuristics for selecting slowly increasing discount factors, and we validate its signiﬁcance empirically.
Researcher Affiliation	Collaboration	Khimya Khetarpal EMAIL Department of Computer Science Mc Gill University Claire Vernade EMAIL University of Tuebingen Brendan O Donoghue EMAIL Google Deepmind Satinder Singh Baveja EMAIL Google Deepmind Tom Zahavy EMAIL Google Deepmind
Pseudocode	Yes	Algorithm 1: POMRL (σ) Planning with Online Meta-Reinforcement Learning Input: Given task-similarity (σ(s, a)) a matrix of size S A. Initialize ˆP o,1 to uniform, α1 = 0. for task t [T] do for tth batch of m samples do ˆ P t(m) = (1 αt) 1 m Pm i=1 Xi + αt ˆP o,t // regularized least squares minimizer. γ γ-Selection-Procedure(m, αt, σ, T, S, A) π ˆ P t,γ Planning( ˆ P t(m)) // Output: π ˆ P t,γ Update ˆP o,t+1, αt+1 = 1 σ2(1+1/t)m+1 // meta-update Ao M (Eq. 5) and mixing rate Algorithm 2: ada-POMRL Planning with Online Meta-Reinforcement Learning Input: Initialize ˆP o,1 to uniform, (ˆσ)1 as a matrix of size S A,α1 = 0. for task t [T] do for tth batch of m samples do ˆ P t(m) = (1 αt) 1 m Pm i=1 Xi + αt ˆP o,t // regularized least squares minimizer. γ γ-Selection-Procedure(m, αt, σt, T, S, A) π ˆ P t,γ Planning( ˆ P t(m)) Output: π ˆ P t,γ Update ˆP o,t+1, ˆσt+1 Welford s online algorithm ( ˆσo)t, ˆP o,t+1, ˆP o,t // meta-update Ao M (Eq. 5) and task-similarity parameter. Update αt+1 = 1 ˆσt+12(1+1/t)m+1 // meta-update mixing rate, plug max(σS A)
Open Source Code	Yes	Source code is provided in the supplementary material.
Open Datasets	No	Setting: For each experiment, we ﬁx a mean model P o S A S (see below how), and for each new task t [T], we sample P t from a Dirichlet distribution3 centered at P o. As prescribed by theory (see Sec.3.2), we set4 σ 0.01 1/S m unless otherwise speciﬁed (see Q3). Note that σ and ˆσ respectively, are S A matrices of state-action dependent variances that capture the directional variance as we used Dirichlet distributions as priors and these have non-uniform variance levels in the simplex, depending on how close to the simplex boundary the mean is located. Aligned with our theory, we use the max of the σ matrices resulting in the aforementioned single scalar value. As in Jiang et al. (2015), P o (and each P t) characterizes a random chain MDP with S = 10 states5 and A = 2 actions, which is drawn such that, for each state action pair, the transition function P(s, a, s ) is constructed by choosing randomly k = 5 states whose probability is set to 0. Then we draw the value of the S k remaining states uniformly in [0, 1] and normalize the resulting vector.
Dataset Splits	No	The paper describes generating tasks from a Dirichlet distribution and using 'm' samples per task. It does not mention explicit training/test/validation splits of a static dataset, but rather an online meta-learning setting where the agent processes a sequence of tasks.
Hardware Specification	No	All experiments were conducted using Google Colab instances7. 7https://colab.research.google.com/
Software Dependencies	No	The paper mentions using Google Colab instances, which implies a software environment, but it does not specify any particular programming languages, libraries, or their version numbers used for implementation.
Experiment Setup	Yes	For each experiment, we ﬁx a mean model P o S A S (see below how), and for each new task t [T], we sample P t from a Dirichlet distribution3 centered at P o. As prescribed by theory (see Sec.3.2), we set4 σ 0.01 1/S m unless otherwise speciﬁed (see Q3). ... For all baselines, the number of samples per task m = 5. Results are averaged over 100 independent runs. Besides, we also propose and empirically validate competitive heuristics for γ-Selection-Procedure in Sec. 6. Besides, we also run another baseline called Aggregating(α = 1), that simply ignores the meta-RL structure and just plans assuming there is a single task (See Appendix F.2).