Selective Uncertainty Propagation in Offline RL

Authors: Sanath Kumar Krishnamurthy, Tanmay Gangwani, Sumeet Katariya, Branislav Kveton, Shrey Modi, Anshuka Rangi

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show benefits of our approach on toy environments and demonstrate the benefits of these techniques for offline policy learning... Figure 1 plots CIs for both selective and standard uncertainty propagation, when varying the evaluation policy... In Figure 2, we plot the value of learnt policy from various algorithms as we vary the number of training episodes. In particular, we compare SPVI, PVI (Jin, Yang, and Wang 2021), and pessimistic supervised learning (PSL).
Researcher Affiliation Collaboration Sanath Kumar Krishnamurthy1, Tanmay Gangwani2, Sumeet Katariya1, Branislav Kveton3, Shrey Modi4, Anshuka Rangi2 1Meta 2Amazon 3Adobe 4Indian Institute of Technology, Bombay EMAIL
Pseudocode Yes In this section, we propose a modification of PVI called selectively pessimistic value iteration (SPVI, complete pseudo-code is available in Appendix B).
Open Source Code No The paper does not provide an explicit statement about open-sourcing the code, nor does it provide a link to a code repository. It mentions 'complete pseudo-code is available in Appendix B' but this is not the same as open-source code.
Open Datasets No The paper mentions a 'simple toy environment called Chain Bandit' for simulations but does not provide any specific access information (link, DOI, repository, or citation) for a publicly available or open dataset.
Dataset Splits No The paper refers to a 'dataset S consisting of T trajectories' and a 'holdout dataset', and discusses 'training episodes', but it does not specify explicit training/test/validation splits (e.g., percentages, sample counts, or specific splitting methodologies with seeds or cross-validation setup) that are needed to reproduce the data partitioning.
Hardware Specification Yes All algorithm runs takes less than 2 mins on a Mac Book Pro M2 16GB.
Software Dependencies No The paper does not provide specific software dependencies, such as library names with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1), that were used to implement the methodology or run the experiments.
Experiment Setup Yes Throughout this section, we consider Chain Bandit with a chain length of 3 and consider the following behavioral policy (πb) at every state and step, πb selects action a3 with probability 0.8 and selects the other two actions with probability 0.1 respectively... Here bonus is given by b(x, a) = p ln(|X| |A| H/δ)/n(x, a) where n(x, a) is the number of times action a was taken at state x, and confidence parameter δ = 0.05... The number of training episodes is 10000, and the plots are averaged over 10 runs.