reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Selective Uncertainty Propagation in Offline RL

Authors: Sanath Kumar Krishnamurthy, Tanmay Gangwani, Sumeet Katariya, Branislav Kveton, Shrey Modi, Anshuka Rangi

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show benefits of our approach on toy environments and demonstrate the benefits of these techniques for offline policy learning... Figure 1 plots CIs for both selective and standard uncertainty propagation, when varying the evaluation policy... In Figure 2, we plot the value of learnt policy from various algorithms as we vary the number of training episodes. In particular, we compare SPVI, PVI (Jin, Yang, and Wang 2021), and pessimistic supervised learning (PSL).
Researcher Affiliation	Collaboration	Sanath Kumar Krishnamurthy1, Tanmay Gangwani2, Sumeet Katariya1, Branislav Kveton3, Shrey Modi4, Anshuka Rangi2 1Meta 2Amazon 3Adobe 4Indian Institute of Technology, Bombay EMAIL
Pseudocode	Yes	In this section, we propose a modification of PVI called selectively pessimistic value iteration (SPVI, complete pseudo-code is available in Appendix B).
Open Source Code	No	The paper does not provide an explicit statement about open-sourcing the code, nor does it provide a link to a code repository. It mentions 'complete pseudo-code is available in Appendix B' but this is not the same as open-source code.
Open Datasets	No	The paper mentions a 'simple toy environment called Chain Bandit' for simulations but does not provide any specific access information (link, DOI, repository, or citation) for a publicly available or open dataset.
Dataset Splits	No	The paper refers to a 'dataset S consisting of T trajectories' and a 'holdout dataset', and discusses 'training episodes', but it does not specify explicit training/test/validation splits (e.g., percentages, sample counts, or specific splitting methodologies with seeds or cross-validation setup) that are needed to reproduce the data partitioning.
Hardware Specification	Yes	All algorithm runs takes less than 2 mins on a Mac Book Pro M2 16GB.
Software Dependencies	No	The paper does not provide specific software dependencies, such as library names with version numbers (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1), that were used to implement the methodology or run the experiments.
Experiment Setup	Yes	Throughout this section, we consider Chain Bandit with a chain length of 3 and consider the following behavioral policy (πb) at every state and step, πb selects action a3 with probability 0.8 and selects the other two actions with probability 0.1 respectively... Here bonus is given by b(x, a) = p ln(\|X\| \|A\| H/δ)/n(x, a) where n(x, a) is the number of times action a was taken at state x, and confidence parameter δ = 0.05... The number of training episodes is 10000, and the plots are averaged over 10 runs.