Offline RL in Regular Decision Processes: Sample Efficiency via Language Metrics
Authors: Ahana Deb, Roberto Cipollone, Anders Jonsson, Alessandro Ronca, Mohammad Sadegh Talebi
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we present experimental results to illustrate the properties of our two versions of ADACT-H. We perform experiments in five domains from the literature on POMDPs and RDPs: Corridor (Ronca & De Giacomo, 2021), T-maze(c) (Bakker, 2001), Cookie (Toro Icarte et al., 2019), Cheese (Mc Callum, 1992) and Mini-hall (Littman et al., 1995), and summarize the results in Table 1. |
| Researcher Affiliation | Collaboration | 1Universitat Pompeu Fabra 2Leonardo S.p.A. 3University of Oxford 4University of Copenhagen |
| Pseudocode | Yes | For reference, we include the pseudocode of ADACT-H(D, δ) in Appendix A. Cipollone et al. (2023) prove that ADACT-H(D, δ) constructs a minimal RDP R with probability at least 1 4AOUδ if D is large enough. Appendix A PSEUDOCODE OF ADACT-H, ADACT-H-A AND REGORL Function ADACT H(D, δ) Function ADACT H A(D, δ, ε, U, C) Algorithm 1: Reg ORL |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. It mentions comparing against an existing algorithm (Flex Fringe) but does not provide a link or statement for its own implementation. |
| Open Datasets | Yes | We perform experiments in five domains from the literature on POMDPs and RDPs: Corridor (Ronca & De Giacomo, 2021), T-maze(c) (Bakker, 2001), Cookie (Toro Icarte et al., 2019), Cheese (Mc Callum, 1992) and Mini-hall (Littman et al., 1995) |
| Dataset Splits | No | The paper mentions splitting the dataset D into two datasets D1 and D2 of the same size for the algorithm's internal use (learning RDP states and training the policy), as described in Algorithm 1, step 1. However, it does not specify explicit training, validation, or test splits for evaluating the model's generalization on held-out data in the experimental section. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions specific algorithms and data structures like "Count-Min-Sketch (CMS)" and compares its approach to "Flex Fringe," but it does not specify version numbers for any software libraries, programming languages, or specific implementations used in their experimental setup. |
| Experiment Setup | Yes | With the increasing corridor length N (and horizon H N), we plot the time taken and RDP size over 20 runs, with K = 100 episodes. Figure 2a shows the time taken for the CMS-based algorithm increases exponentially whereas there is only a linear increase for the language-based approach, which is expected since the number of RDP states generated also increases linearly with H. In Table 1, we see that ADACT-H with the language family X3,1 is faster than Flex Fringe in all domains except T-maze(c) here Flex Fringe fails to find the optimal policy, since the heuristics used are not optimized to preserve reward and outputs smaller automata than both Flex Fringe and CMS in all domains except Mini-hall. |