KD-BIRL: Kernel Density Bayesian Inverse Reinforcement Learning
Authors: Aishwarya Mandyam, Didong Li, Andrew Jones, Diana Cai, Barbara E Engelhardt
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results highlight KDBIRL s faster concentration rate in comparison to baselines, particularly in low test task expert demonstration data regimes. Additionally, we are the first to provide theoretical guarantees of posterior concentration for a Bayesian IRL algorithm. Taken together, this work introduces a principled and theoretically grounded framework that enables Bayesian IRL to be applied across a variety of domains. |
| Researcher Affiliation | Academia | Aishwarya Mandyam EMAIL Department of Computer Science Stanford University Didong Li EMAIL Department of Biostatistics University of North Carolina Diana Cai dcai@flatironinstitute.org Flatiron Institute Andrew Jones EMAIL Department of Computer Science Princeton University Barbara E. Engelhardt EMAIL Gladstone Institutes Department of Biomedical Data Science Stanford University |
| Pseudocode | Yes | We use a Hamiltonian Monte Carlo algorithm (Team, 2011) (details in Appendix F, and Algorithm 1) which is suited to large parameter spaces. |
| Open Source Code | No | No explicit statement about open-source code for the described methodology or a repository link was found in the paper. |
| Open Datasets | No | The first is a Gridworld setting with a discrete state space. We use three grid sizes (2 2, 5 5 and 10 10) to investigate how KD-BIRL s performance scales. The second setting is a simulated sepsis treatment environment (Amirhossein Kiani, 2019), which has a continuous state space and is thus, more challenging. |
| Dataset Splits | No | We assume that we have several training tasks and a single test task. For each training task, we have access to both optimal demonstrations from the corresponding expert RL agent, and know the reward function the expert is optimizing for. Specifically, there are m samples in the training dataset {(sj, aj, Rj)}m j=1...Our goal is to learn the unknown reward function Rı of a new test task given a limited amount of expert demonstrations of the new test task, {(se i)}n i=1, (n << m). |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models, memory, or cloud instances with specifications) were provided in the paper. |
| Software Dependencies | Yes | We use a Hamiltonian Monte Carlo algorithm (Team, 2011) (details in Appendix F, and Algorithm 1) which is suited to large parameter spaces. |
| Experiment Setup | Yes | We choose the bandwidth hyperparameters h, hÕ using rule-of-thumb procedures (Silverman, 1986). These procedures define the optimal bandwidth hyperparameters as the variance of the pairwise distance between the training data demonstrations and the training data reward functions respectively. ... AVRIL uses variational inference to approximate the posterior distribution on the reward function... AVRIL is initialized using an informative prior learned from the training tasks. ... p(R) = N(µ0, 2 0), where µ0 = 1 j=1 Rj(sj, aj) and 2 j=1(Rj(sj, aj) µ0)2. |