reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Offline RL in Regular Decision Processes: Sample Efficiency via Language Metrics

Authors: Ahana Deb, Roberto Cipollone, Anders Jonsson, Alessandro Ronca, Mohammad Sadegh Talebi

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we present experimental results to illustrate the properties of our two versions of ADACT-H. We perform experiments in five domains from the literature on POMDPs and RDPs: Corridor (Ronca & De Giacomo, 2021), T-maze(c) (Bakker, 2001), Cookie (Toro Icarte et al., 2019), Cheese (Mc Callum, 1992) and Mini-hall (Littman et al., 1995), and summarize the results in Table 1.
Researcher Affiliation	Collaboration	1Universitat Pompeu Fabra 2Leonardo S.p.A. 3University of Oxford 4University of Copenhagen
Pseudocode	Yes	For reference, we include the pseudocode of ADACT-H(D, δ) in Appendix A. Cipollone et al. (2023) prove that ADACT-H(D, δ) constructs a minimal RDP R with probability at least 1 4AOUδ if D is large enough. Appendix A PSEUDOCODE OF ADACT-H, ADACT-H-A AND REGORL Function ADACT H(D, δ) Function ADACT H A(D, δ, ε, U, C) Algorithm 1: Reg ORL
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described. It mentions comparing against an existing algorithm (Flex Fringe) but does not provide a link or statement for its own implementation.
Open Datasets	Yes	We perform experiments in five domains from the literature on POMDPs and RDPs: Corridor (Ronca & De Giacomo, 2021), T-maze(c) (Bakker, 2001), Cookie (Toro Icarte et al., 2019), Cheese (Mc Callum, 1992) and Mini-hall (Littman et al., 1995)
Dataset Splits	No	The paper mentions splitting the dataset D into two datasets D1 and D2 of the same size for the algorithm's internal use (learning RDP states and training the policy), as described in Algorithm 1, step 1. However, it does not specify explicit training, validation, or test splits for evaluating the model's generalization on held-out data in the experimental section.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions specific algorithms and data structures like "Count-Min-Sketch (CMS)" and compares its approach to "Flex Fringe," but it does not specify version numbers for any software libraries, programming languages, or specific implementations used in their experimental setup.
Experiment Setup	Yes	With the increasing corridor length N (and horizon H N), we plot the time taken and RDP size over 20 runs, with K = 100 episodes. Figure 2a shows the time taken for the CMS-based algorithm increases exponentially whereas there is only a linear increase for the language-based approach, which is expected since the number of RDP states generated also increases linearly with H. In Table 1, we see that ADACT-H with the language family X3,1 is faster than Flex Fringe in all domains except T-maze(c) here Flex Fringe fails to find the optimal policy, since the heuristics used are not optimized to preserve reward and outputs smaller automata than both Flex Fringe and CMS in all domains except Mini-hall.