Learning a Fast Mixing Exogenous Block MDP using a Single Trajectory
Authors: Alexander Levine, Peter Stone, Amy Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove that STEEL is correct and sample-efficient, and demonstrate STEEL on two toy problems. Code is available at: https://github.com/midi-lab/steel. SIMULATION EXPERIMENTS We test the STEEL algorithm on two toy problems: an infinite-horizon environment inspired by the combination lock environment from Efroni et al. (2022b), and a version of the multi-maze environment from Lamb et al. (2023). ... Results are shown in Table 2. |
| Researcher Affiliation | Collaboration | Alexander Levine The University of Texas at Austin Peter Stone The University of Texas at Austin Sony AI Amy Zhang The University of Texas at Austin |
| Pseudocode | Yes | A FULL ALGORITHM The STEEL algorithm is presented here in full as Algorithm 1, with a major subroutine, Cycle Find, split out as Algorithm 2. |
| Open Source Code | Yes | Code is available at: https://github.com/midi-lab/steel. |
| Open Datasets | No | We test the STEEL algorithm on two toy problems: an infinite-horizon environment inspired by the combination lock environment from Efroni et al. (2022b), and a version of the multi-maze environment from Lamb et al. (2023). |
| Dataset Splits | No | When assessing the minimum-range minimal loss encoder as in Equation 140, Levine et al. (2024) include an empirical fudge factor : their protocol returns the minimumrange encoder that achieves a loss within 0.1% of the true minimum loss over all possible encoders. We found that even without this fudge factor, the multistep-inverse methods were still heavily biased towards returning ϕSR, when applied either to Single Prime(p, q); or to Double Prime(p, q) with too-small K or too-few samples. ... To better match the spirit of learning an Ex-BMDP from a single trajectory , we collect the fitting and optimization sets from one single trajectory, one after the other, with each corresponding to half of the trajectory: we regard the total sample complexfity as the length of the entire trajectory. |
| Hardware Specification | No | The paper describes simulation experiments but does not provide specific hardware details such as GPU/CPU models or other computing infrastructure used for running these simulations. |
| Software Dependencies | No | The paper does not mention specific software dependencies with version numbers, such as programming languages or libraries (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For all experiments, we set δ = ϵ = .05. For the combination lock experiments, we set L = 512, and use the (intentionally loose) upper bounds N = ˆD = K + 10 (= |S| + 10) and ˆtmix = 40. For the multi-maze environment, we use N = ˆD = 80 (> |S| = 68), and ˆtmix = 300. See Appendix D for how we chose the (loose) bounds ˆtmix tmix. For STEEL, we set N = ˆD = 2q (to match the maximum number of states in Double Prime(p, q) or Single Prime(p, q)); ˆtmix = 1 (because neither Double Prime(p, q) nor Single Prime(p, q) have time-correlated noise); ϵ = 0.49 (because, due to the simple emission distributions Q of Single Prime and Double Prime, a 51% encoder accuracy on each latent state implies a 100% encoder accuracy); and δ = 0.05. |